Problem Statement:¶

This problem statement is based on the Shinkansen Bullet Train in Japan, and passengers’ experience with that mode of travel. This machine-learning exercise aims to determine the relative importance of each parameter with regard to their contribution to the passengers’ overall travel experience. The dataset contains a random sample of individuals who travelled on this train. The on-time performance of the trains along with passenger information is published in a file named ‘Traveldata_train.csv’. These passengers were later asked to provide their feedback on various parameters related to the travel along with their overall experience. These collected details are made available in the survey report labelled ‘Surveydata_train.csv’.

In the survey, each passenger was explicitly asked whether they were satisfied with their overall travel experience or not, and that is captured in the data of the survey report under the variable labelled ‘Overall_Experience’.

The objective of this problem is to understand which parameters play an important role in swaying passenger feedback towards a positive scale. You are provided test data containing the travel data and the survey data of passengers. Both the test data and the train data are collected at the same time and belong to the same population.


Data Dictionary:¶

ID - The unique ID of the passenger

Gender - The gender of the passenger

Customer_Type - Loyalty type of the passenger

Age - The age of the passenger

Type_Travel - Purpose of travel for the passenger

Travel_Class - The train class that the passenger traveled in

Travel_Distance - The distance traveled by the passenger

Departure_Delay_in_Mins - The delay (in minutes) in train departure

Arrival_Delay_in_Mins - The delay (in minutes) in train arrival

ID - The unique ID of the passenger

Platform_Location - How convenient the location of the platform is for the passenger

Seat_Class - The type of the seat class in the train. Green Car seats are usually more spacious and comfortable than ordinary seats. On the Shinkansen train, there are only four seats per row in the Green Car, versus five in the ordinary car.

Overall_Experience - The overall experience of the passenger

Seat_Comfort - The comfort level of the seat for the passenger

Arrival_Time_Convenient - How convenient the arrival time of the train is for the passenger

Catering - How convenient the catering service is for the passenger

Onboard_Wifi_Service - The quality of the onboard Wi-Fi service for the passenger

Onboard_Entertainment - The quality of the onboard entertainment for the passenger

Online_Support - The quality of the online support for the passenger

Ease_of_Online_Booking - The ease of online booking for the passenger

Onboard_Service - The quality of the onboard service for the passenger

Legroom - the general term used in place of the more accurate “seat pitch”, which is the distance between a point on one seat and the same point on the seat in front of it. This variable describes the convenience of the legroom provided for the passenger

Baggage_Handling - The convenience of baggage handling for the passenger

CheckIn_Service - The convenience of the check-in service for the passenger

Cleanliness - The passenger's view of the cleanliness of the service

Online_Boarding - The convenience of the online boarding process for the passenger

Importing Libraries and Datset¶

Making sure Keras 2.15.0 is installed

In [ ]:
!pip install keras==2.12.0
Collecting keras==2.12.0
  Downloading keras-2.12.0-py2.py3-none-any.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 10.0 MB/s eta 0:00:00
Installing collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 2.15.0
    Uninstalling keras-2.15.0:
      Successfully uninstalled keras-2.15.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 2.12.0 which is incompatible.
Successfully installed keras-2.12.0
In [ ]:
!pip install tensorflow-addons
Collecting tensorflow-addons
  Downloading tensorflow_addons-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (611 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 611.8/611.8 kB 6.7 MB/s eta 0:00:00
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tensorflow-addons) (23.2)
Collecting typeguard<3.0.0,>=2.7 (from tensorflow-addons)
  Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Installing collected packages: typeguard, tensorflow-addons
Successfully installed tensorflow-addons-0.23.0 typeguard-2.13.3

Import Libraries

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings from displaying to console
import warnings
warnings.filterwarnings("ignore")

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)

# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# To build models for prediction
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeClassifier #DecisionTreeClassifier is used for categorical variables. But since we used get dummies, everything we have is going to be numerical variables
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve, recall_score, precision_score, f1_score, accuracy_score
from sklearn import tree

# To encode categorical variables
from sklearn.preprocessing import LabelEncoder

# For tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To check model performance
from sklearn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error

#Import tensorflow for deep learning
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input, Dropout,BatchNormalization
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
import random
from tensorflow.keras import backend
In [ ]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
Mounted at /content/drive

Import Datasets

In [ ]:
#Change the path locations accordingly
surveydata_test = pd.read_csv('/content/drive/MyDrive/GreatLearning/Hackathon/Datasets/Surveydata_test.csv')
surveydata_train = pd.read_csv('/content/drive/MyDrive/GreatLearning/Hackathon/Datasets/Surveydata_train.csv')
traveldata_test = pd.read_csv('/content/drive/MyDrive/GreatLearning/Hackathon/Datasets/Traveldata_test.csv')
traveldata_train = pd.read_csv('/content/drive/MyDrive/GreatLearning/Hackathon/Datasets/Traveldata_train.csv')
sample_submission = pd.read_csv('/content/drive/MyDrive/GreatLearning/Hackathon/Datasets/Sample_Submission.csv')
data_dictionary = pd.read_excel('/content/drive/MyDrive/GreatLearning/Hackathon/Datasets/Data_Dictionary.xlsx')

Making copies of each dataset

In [ ]:
#This is done to protect the data, in case you accidentally deleted the data (better be safe than sorry)
surveydata_test_copy = surveydata_test.copy()
surveydata_train_copy = surveydata_train.copy()
traveldata_test_copy = traveldata_test.copy()
traveldata_train_copy = traveldata_train.copy()
sample_sumission_copy = sample_submission.copy()

Understanding the Data¶

In [ ]:
#This function will study the data's shape, datatype, number of null data and number of duplicate data.
def studydata(df):
    print("Shape:")
    print(df.shape)
    print("\nInfo:")
    print(df.info())
    print("\nNull:")
    print(df.isnull().sum())
    print("\nDuplicates:")
    print(df.duplicated().sum())

    # print("\nHead:")
    # print(df.head().T)
    # print("\nTail:")
    # print(df.tail().T)

Survey_Train¶

In [ ]:
studydata(surveydata_train)
Shape:
(94379, 17)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   ID                       94379 non-null  int64 
 1   Overall_Experience       94379 non-null  int64 
 2   Seat_Comfort             94318 non-null  object
 3   Seat_Class               94379 non-null  object
 4   Arrival_Time_Convenient  85449 non-null  object
 5   Catering                 85638 non-null  object
 6   Platform_Location        94349 non-null  object
 7   Onboard_Wifi_Service     94349 non-null  object
 8   Onboard_Entertainment    94361 non-null  object
 9   Online_Support           94288 non-null  object
 10  Ease_of_Online_Booking   94306 non-null  object
 11  Onboard_Service          86778 non-null  object
 12  Legroom                  94289 non-null  object
 13  Baggage_Handling         94237 non-null  object
 14  CheckIn_Service          94302 non-null  object
 15  Cleanliness              94373 non-null  object
 16  Online_Boarding          94373 non-null  object
dtypes: int64(2), object(15)
memory usage: 12.2+ MB
None

Null:
ID                            0
Overall_Experience            0
Seat_Comfort                 61
Seat_Class                    0
Arrival_Time_Convenient    8930
Catering                   8741
Platform_Location            30
Onboard_Wifi_Service         30
Onboard_Entertainment        18
Online_Support               91
Ease_of_Online_Booking       73
Onboard_Service            7601
Legroom                      90
Baggage_Handling            142
CheckIn_Service              77
Cleanliness                   6
Online_Boarding               6
dtype: int64

Duplicates:
0
In [ ]:
#Transposed for easier view
surveydata_train.head().T
Out[ ]:
0 1 2 3 4
ID 98800001 98800002 98800003 98800004 98800005
Overall_Experience 0 0 1 0 1
Seat_Comfort Needs Improvement Poor Needs Improvement Acceptable Acceptable
Seat_Class Green Car Ordinary Green Car Ordinary Ordinary
Arrival_Time_Convenient Excellent Excellent Needs Improvement Needs Improvement Acceptable
Catering Excellent Poor Needs Improvement NaN Acceptable
Platform_Location Very Convenient Needs Improvement Needs Improvement Needs Improvement Manageable
Onboard_Wifi_Service Good Good Needs Improvement Acceptable Needs Improvement
Onboard_Entertainment Needs Improvement Poor Good Needs Improvement Good
Online_Support Acceptable Good Excellent Acceptable Excellent
Ease_of_Online_Booking Needs Improvement Good Excellent Acceptable Good
Onboard_Service Needs Improvement Excellent Excellent Acceptable Good
Legroom Acceptable Needs Improvement Excellent Acceptable Good
Baggage_Handling Needs Improvement Poor Excellent Acceptable Good
CheckIn_Service Good Needs Improvement Good Good Good
Cleanliness Needs Improvement Good Excellent Acceptable Good
Online_Boarding Poor Good Excellent Acceptable Good
In [ ]:
surveydata_train.tail().T
Out[ ]:
94374 94375 94376 94377 94378
ID 98894375 98894376 98894377 98894378 98894379
Overall_Experience 0 1 1 0 0
Seat_Comfort Poor Good Needs Improvement Needs Improvement Acceptable
Seat_Class Ordinary Ordinary Green Car Ordinary Ordinary
Arrival_Time_Convenient Good Good Needs Improvement NaN Poor
Catering Good Good Needs Improvement Needs Improvement Acceptable
Platform_Location Convenient Convenient Needs Improvement Convenient Manageable
Onboard_Wifi_Service Poor Needs Improvement Good Good Acceptable
Onboard_Entertainment Poor Excellent Excellent Needs Improvement Acceptable
Online_Support Poor Excellent Good Good Acceptable
Ease_of_Online_Booking Poor Acceptable Good Good Acceptable
Onboard_Service Good Acceptable Good Acceptable Poor
Legroom Good Acceptable Good Good Good
Baggage_Handling Good Acceptable Good Good Good
CheckIn_Service Needs Improvement Good Acceptable Good Poor
Cleanliness Good Acceptable Good Excellent Good
Online_Boarding Poor Good Acceptable Good Acceptable
In [ ]:
#Finding the percentage of null values
surveydata_train.isnull().sum()/len(surveydata_train)
Out[ ]:
ID                         0.000000
Overall_Experience         0.000000
Seat_Comfort               0.000646
Seat_Class                 0.000000
Arrival_Time_Convenient    0.094619
Catering                   0.092616
Platform_Location          0.000318
Onboard_Wifi_Service       0.000318
Onboard_Entertainment      0.000191
Online_Support             0.000964
Ease_of_Online_Booking     0.000773
Onboard_Service            0.080537
Legroom                    0.000954
Baggage_Handling           0.001505
CheckIn_Service            0.000816
Cleanliness                0.000064
Online_Boarding            0.000064
dtype: float64

Observations:

  1. 94379 entries
  2. ID and Overall experience is integer, rest are objects
  3. Overall experience is binary
  4. Most of the columns have missing values, only ID, overall experience, and seat class does not have missing values
  5. Most of the data here can be turned into 1 2 3 4 5 6 instead of ordinal values
  6. Arrival time, onboard service and catering has almost 10% missing values
  7. Data does not have duplicated entries

Survey_Test¶

In [ ]:
studydata(surveydata_test)
Shape:
(35602, 16)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35602 entries, 0 to 35601
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   ID                       35602 non-null  int64 
 1   Seat_Comfort             35580 non-null  object
 2   Seat_Class               35602 non-null  object
 3   Arrival_Time_Convenient  32277 non-null  object
 4   Catering                 32245 non-null  object
 5   Platform_Location        35590 non-null  object
 6   Onboard_Wifi_Service     35590 non-null  object
 7   Onboard_Entertainment    35594 non-null  object
 8   Online_Support           35576 non-null  object
 9   Ease_of_Online_Booking   35584 non-null  object
 10  Onboard_Service          32730 non-null  object
 11  Legroom                  35577 non-null  object
 12  Baggage_Handling         35562 non-null  object
 13  CheckIn_Service          35580 non-null  object
 14  Cleanliness              35600 non-null  object
 15  Online_Boarding          35600 non-null  object
dtypes: int64(1), object(15)
memory usage: 4.3+ MB
None

Null:
ID                            0
Seat_Comfort                 22
Seat_Class                    0
Arrival_Time_Convenient    3325
Catering                   3357
Platform_Location            12
Onboard_Wifi_Service         12
Onboard_Entertainment         8
Online_Support               26
Ease_of_Online_Booking       18
Onboard_Service            2872
Legroom                      25
Baggage_Handling             40
CheckIn_Service              22
Cleanliness                   2
Online_Boarding               2
dtype: int64

Duplicates:
0
In [ ]:
surveydata_test.head().T
Out[ ]:
0 1 2 3 4
ID 99900001 99900002 99900003 99900004 99900005
Seat_Comfort Acceptable Extremely Poor Excellent Acceptable Excellent
Seat_Class Green Car Ordinary Ordinary Green Car Ordinary
Arrival_Time_Convenient Acceptable Good Excellent Excellent Extremely Poor
Catering Acceptable Poor Excellent Acceptable Excellent
Platform_Location Manageable Manageable Very Convenient Very Convenient Needs Improvement
Onboard_Wifi_Service Needs Improvement Acceptable Excellent Poor Excellent
Onboard_Entertainment Excellent Poor Excellent Acceptable Excellent
Online_Support Good Acceptable Excellent Excellent Excellent
Ease_of_Online_Booking Excellent Acceptable Needs Improvement Poor Excellent
Onboard_Service Excellent Excellent Needs Improvement Acceptable NaN
Legroom Excellent Acceptable Needs Improvement Needs Improvement Acceptable
Baggage_Handling Excellent Good Needs Improvement Excellent Excellent
CheckIn_Service Good Acceptable Good Excellent Excellent
Cleanliness Excellent Excellent Needs Improvement Excellent Excellent
Online_Boarding Poor Acceptable Excellent Poor Excellent
In [ ]:
surveydata_test.tail().T
Out[ ]:
35597 35598 35599 35600 35601
ID 99935598 99935599 99935600 99935601 99935602
Seat_Comfort Needs Improvement Needs Improvement Good Excellent Good
Seat_Class Green Car Ordinary Green Car Ordinary Ordinary
Arrival_Time_Convenient Excellent Needs Improvement Extremely Poor Excellent Acceptable
Catering Needs Improvement Good Good Excellent Good
Platform_Location Manageable Needs Improvement Needs Improvement Inconvenient Manageable
Onboard_Wifi_Service Acceptable Acceptable Needs Improvement Acceptable Poor
Onboard_Entertainment Needs Improvement Excellent Good Excellent Good
Online_Support Acceptable Excellent Poor Good Poor
Ease_of_Online_Booking Acceptable Good Needs Improvement Excellent Poor
Onboard_Service Good Good Poor Excellent Acceptable
Legroom Excellent Good Acceptable Excellent Good
Baggage_Handling Good Good Poor Excellent Good
CheckIn_Service Acceptable Acceptable Poor Acceptable Needs Improvement
Cleanliness Good Good Excellent Excellent Good
Online_Boarding Acceptable Good Needs Improvement Good Poor
In [ ]:
#Understanding what is the ratio of training data to testing data
total = len(surveydata_test) + len(surveydata_train)
len(surveydata_test)/total
Out[ ]:
0.2739015702294951

Observations:

  1. 35602 Entries
  2. Doesn’t have overall experience
  3. Most of the columns have missing values. Only ID and seat class do not have missing values
  4. Arrival time, onboard service and catering has almost 10% missing values
  5. Data does not have duplicated entries
  6. The train to test data ratio is around 70:30

Travel_Train¶

In [ ]:
studydata(traveldata_train)
Shape:
(94379, 9)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       94379 non-null  int64  
 1   Gender                   94302 non-null  object 
 2   Customer_Type            85428 non-null  object 
 3   Age                      94346 non-null  float64
 4   Type_Travel              85153 non-null  object 
 5   Travel_Class             94379 non-null  object 
 6   Travel_Distance          94379 non-null  int64  
 7   Departure_Delay_in_Mins  94322 non-null  float64
 8   Arrival_Delay_in_Mins    94022 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 6.5+ MB
None

Null:
ID                            0
Gender                       77
Customer_Type              8951
Age                          33
Type_Travel                9226
Travel_Class                  0
Travel_Distance               0
Departure_Delay_in_Mins      57
Arrival_Delay_in_Mins       357
dtype: int64

Duplicates:
0
In [ ]:
traveldata_train.head().T
Out[ ]:
0 1 2 3 4
ID 98800001 98800002 98800003 98800004 98800005
Gender Female Male Female Female Female
Customer_Type Loyal Customer Loyal Customer Loyal Customer Loyal Customer Loyal Customer
Age 52.0 48.0 43.0 44.0 50.0
Type_Travel NaN Personal Travel Business Travel Business Travel Business Travel
Travel_Class Business Eco Business Business Business
Travel_Distance 272 2200 1061 780 1981
Departure_Delay_in_Mins 0.0 9.0 77.0 13.0 0.0
Arrival_Delay_in_Mins 5.0 0.0 119.0 18.0 0.0
In [ ]:
traveldata_train.tail().T
Out[ ]:
94374 94375 94376 94377 94378
ID 98894375 98894376 98894377 98894378 98894379
Gender Male Male Male Male Male
Customer_Type Loyal Customer Loyal Customer NaN Loyal Customer Loyal Customer
Age 32.0 44.0 63.0 16.0 54.0
Type_Travel Business Travel Business Travel Business Travel Personal Travel NaN
Travel_Class Business Business Business Eco Eco
Travel_Distance 1357 592 2794 2744 2107
Departure_Delay_in_Mins 83.0 5.0 0.0 0.0 28.0
Arrival_Delay_in_Mins 125.0 11.0 0.0 0.0 28.0
In [ ]:
#Since travel data has many numerical variables, lets look at their 5 number summaries
traveldata_train.describe()
Out[ ]:
ID Age Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
count 9.437900e+04 94346.000000 94379.000000 94322.000000 94022.000000
mean 9.884719e+07 39.419647 1978.888185 14.647092 15.005222
std 2.724501e+04 15.116632 1027.961019 38.138781 38.439409
min 9.880000e+07 7.000000 50.000000 0.000000 0.000000
25% 9.882360e+07 27.000000 1359.000000 0.000000 0.000000
50% 9.884719e+07 40.000000 1923.000000 0.000000 0.000000
75% 9.887078e+07 51.000000 2538.000000 12.000000 13.000000
max 9.889438e+07 85.000000 6951.000000 1592.000000 1584.000000

Observations:

  1. Same number of entries as surveydata_train
  2. ID, Age, travel distance, delay in departure and arrival are numerical variables. The others are categorical
  3. Only ID, travel class and travel distance do not have missing values.
  4. Customer_type and type_travel has almost 10% missing values
  5. There are also more missing values in arrival delays as compared to departure delays
  6. No duplicated values in dataset
  7. The average passenger is 39 years old with a standard deviation of 15 years. 75% of the passengers are between 24 and 54 years old and 88 of the passengers are between 9 and 69 years old.
  8. The travel distance is in average 1978 km, from a minimum of 50 km to a maximum of 6951 km. We are analysing data of long distance rail journeys.
  9. Minimum travel distance is 50km, which may need to confirm the credibility as the distance is very short
  10. Delay in departure and arrivals are very highly skewed to the left, which means that delays do not often happen. But when they do, they may exist as large delays.
  11. Since both arrival and departure have similar 5 number summaries, they may be feature engineered into a single variable
  12. Delay time has an average delay of 14 minutes, with a very high standard deviation of 38 minutes. The minimum is 0 minutes and maximum is 1592 minutes (26 hours), definitely an outlier.
  13. Arrival time has an average of 15 minutes, with a very high standard deviation of 38 minutes. The minimum is 0 minutes and maximum is 1584 minutes (26 hours), also definitely an outlier.

Travel_Test¶

In [ ]:
studydata(traveldata_test)
Shape:
(35602, 9)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35602 entries, 0 to 35601
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       35602 non-null  int64  
 1   Gender                   35572 non-null  object 
 2   Customer_Type            32219 non-null  object 
 3   Age                      35591 non-null  float64
 4   Type_Travel              32154 non-null  object 
 5   Travel_Class             35602 non-null  object 
 6   Travel_Distance          35602 non-null  int64  
 7   Departure_Delay_in_Mins  35573 non-null  float64
 8   Arrival_Delay_in_Mins    35479 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 2.4+ MB
None

Null:
ID                            0
Gender                       30
Customer_Type              3383
Age                          11
Type_Travel                3448
Travel_Class                  0
Travel_Distance               0
Departure_Delay_in_Mins      29
Arrival_Delay_in_Mins       123
dtype: int64

Duplicates:
0
In [ ]:
traveldata_test.head().T
Out[ ]:
0 1 2 3 4
ID 99900001 99900002 99900003 99900004 99900005
Gender Female Female Male Female Male
Customer_Type NaN Disloyal Customer Loyal Customer Loyal Customer Disloyal Customer
Age 36.0 21.0 60.0 29.0 18.0
Type_Travel Business Travel Business Travel Business Travel Personal Travel Business Travel
Travel_Class Business Business Business Eco Business
Travel_Distance 532 1425 2832 1352 1610
Departure_Delay_in_Mins 0.0 9.0 0.0 0.0 17.0
Arrival_Delay_in_Mins 0.0 28.0 0.0 0.0 0.0
In [ ]:
traveldata_test.tail().T
Out[ ]:
35597 35598 35599 35600 35601
ID 99935598 99935599 99935600 99935601 99935602
Gender Male Female Male Female Male
Customer_Type Loyal Customer Loyal Customer Disloyal Customer Loyal Customer NaN
Age 8.0 53.0 22.0 67.0 20.0
Type_Travel Personal Travel Business Travel Business Travel Personal Travel Personal Travel
Travel_Class Eco Business Eco Eco Eco
Travel_Distance 1334 1772 1180 420 1680
Departure_Delay_in_Mins 0.0 0.0 0.0 23.0 0.0
Arrival_Delay_in_Mins 0.0 0.0 0.0 16.0 0.0
In [ ]:
traveldata_test.describe()
Out[ ]:
ID Age Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
count 3.560200e+04 35591.000000 35602.000000 35573.000000 35479.000000
mean 9.991780e+07 39.446995 1987.151761 14.880696 15.308802
std 1.027756e+04 15.137554 1024.308863 37.895453 38.531293
min 9.990000e+07 7.000000 50.000000 0.000000 0.000000
25% 9.990890e+07 27.000000 1360.000000 0.000000 0.000000
50% 9.991780e+07 40.000000 1929.000000 0.000000 0.000000
75% 9.992670e+07 51.000000 2559.000000 13.000000 13.000000
max 9.993560e+07 85.000000 6868.000000 978.000000 970.000000

Observations:

  1. Same number of entries as survey test
  2. ID, Age, travel distance, delay in departure and arrival are numerical variables. The others are categorical
  3. Only ID, travel class and travel distance do not have missing values.
  4. Customer_type and type_travel has almost 10% missing values
  5. No duplicated values in dataset
  6. Statistical distributions are quite similar to that of traveldata_train

Combine Training Data for EDA¶

In [ ]:
#Checking if the travel and survey training data have same IDs
if traveldata_train['ID'].nunique()==surveydata_train['ID'].nunique():
    print(f"the unique ids are the same number")
    n_passengers = traveldata_train['ID'].nunique()
    print(f"there are {n_passengers} passengers in total")
the unique ids are the same number
there are 94379 passengers in total
In [ ]:
#merge dataframes
train = pd.merge(traveldata_train,surveydata_train,how='inner',on='ID')
if n_passengers == train['ID'].nunique():
    print('merge is succesfull, all passengers are in the final dataframe')
merge is succesfull, all passengers are in the final dataframe
In [ ]:
studydata(train)
Shape:
(94379, 25)

Info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 94379 entries, 0 to 94378
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       94379 non-null  int64  
 1   Gender                   94302 non-null  object 
 2   Customer_Type            85428 non-null  object 
 3   Age                      94346 non-null  float64
 4   Type_Travel              85153 non-null  object 
 5   Travel_Class             94379 non-null  object 
 6   Travel_Distance          94379 non-null  int64  
 7   Departure_Delay_in_Mins  94322 non-null  float64
 8   Arrival_Delay_in_Mins    94022 non-null  float64
 9   Overall_Experience       94379 non-null  int64  
 10  Seat_Comfort             94318 non-null  object 
 11  Seat_Class               94379 non-null  object 
 12  Arrival_Time_Convenient  85449 non-null  object 
 13  Catering                 85638 non-null  object 
 14  Platform_Location        94349 non-null  object 
 15  Onboard_Wifi_Service     94349 non-null  object 
 16  Onboard_Entertainment    94361 non-null  object 
 17  Online_Support           94288 non-null  object 
 18  Ease_of_Online_Booking   94306 non-null  object 
 19  Onboard_Service          86778 non-null  object 
 20  Legroom                  94289 non-null  object 
 21  Baggage_Handling         94237 non-null  object 
 22  CheckIn_Service          94302 non-null  object 
 23  Cleanliness              94373 non-null  object 
 24  Online_Boarding          94373 non-null  object 
dtypes: float64(3), int64(3), object(19)
memory usage: 18.7+ MB
None

Null:
ID                            0
Gender                       77
Customer_Type              8951
Age                          33
Type_Travel                9226
Travel_Class                  0
Travel_Distance               0
Departure_Delay_in_Mins      57
Arrival_Delay_in_Mins       357
Overall_Experience            0
Seat_Comfort                 61
Seat_Class                    0
Arrival_Time_Convenient    8930
Catering                   8741
Platform_Location            30
Onboard_Wifi_Service         30
Onboard_Entertainment        18
Online_Support               91
Ease_of_Online_Booking       73
Onboard_Service            7601
Legroom                      90
Baggage_Handling            142
CheckIn_Service              77
Cleanliness                   6
Online_Boarding               6
dtype: int64

Duplicates:
0
In [ ]:
#Percentage of null values in each feature
train.isnull().sum()/train.shape[0]*100
Out[ ]:
ID                         0.000000
Gender                     0.081586
Customer_Type              9.484101
Age                        0.034965
Type_Travel                9.775480
Travel_Class               0.000000
Travel_Distance            0.000000
Departure_Delay_in_Mins    0.060395
Arrival_Delay_in_Mins      0.378262
Overall_Experience         0.000000
Seat_Comfort               0.064633
Seat_Class                 0.000000
Arrival_Time_Convenient    9.461851
Catering                   9.261594
Platform_Location          0.031787
Onboard_Wifi_Service       0.031787
Onboard_Entertainment      0.019072
Online_Support             0.096420
Ease_of_Online_Booking     0.077348
Onboard_Service            8.053698
Legroom                    0.095360
Baggage_Handling           0.150457
CheckIn_Service            0.081586
Cleanliness                0.006357
Online_Boarding            0.006357
dtype: float64

Observations:

  1. Only Customer_Type, Type_Travel, Arrival_Time_Convenient, Catering and Onboard_Service have null values of 8 to 9%.
  2. All other values have less than 1% of null values, and can be imputed with average values.
In [ ]:
#Retrieving the names of numerical columns and categorical columns
num_cols = train._get_numeric_data().columns
cat_cols = train.select_dtypes(exclude='number').columns

#Creating the order sequence for countplots
satisfaction_scale = ['Excellent','Good','Acceptable','Needs Improvement','Poor','Extremely Poor']
location_scale = ['Very Convenient', 'Convenient','Manageable','Needs Improvement','Inconvenient','Very Inconvenient']
In [ ]:
# Printing the count of unique categorical levels in each column
for column in cat_cols:
    print(train[column].value_counts())
    print("-" * 50)
Female    47815
Male      46487
Name: Gender, dtype: int64
--------------------------------------------------
Loyal Customer       69823
Disloyal Customer    15605
Name: Customer_Type, dtype: int64
--------------------------------------------------
Business Travel    58617
Personal Travel    26536
Name: Type_Travel, dtype: int64
--------------------------------------------------
Eco         49342
Business    45037
Name: Travel_Class, dtype: int64
--------------------------------------------------
Acceptable           21158
Needs Improvement    20946
Good                 20595
Poor                 15185
Excellent            12971
Extremely Poor        3463
Name: Seat_Comfort, dtype: int64
--------------------------------------------------
Green Car    47435
Ordinary     46944
Name: Seat_Class, dtype: int64
--------------------------------------------------
Good                 19574
Excellent            17684
Acceptable           15177
Needs Improvement    14990
Poor                 13692
Extremely Poor        4332
Name: Arrival_Time_Convenient, dtype: int64
--------------------------------------------------
Acceptable           18468
Needs Improvement    17978
Good                 17969
Poor                 13858
Excellent            13455
Extremely Poor        3910
Name: Catering, dtype: int64
--------------------------------------------------
Manageable           24173
Convenient           21912
Needs Improvement    17832
Inconvenient         16449
Very Convenient      13981
Very Inconvenient        2
Name: Platform_Location, dtype: int64
--------------------------------------------------
Good                 22835
Excellent            20968
Acceptable           20118
Needs Improvement    19596
Poor                 10741
Extremely Poor          91
Name: Onboard_Wifi_Service, dtype: int64
--------------------------------------------------
Good                 30446
Excellent            21644
Acceptable           17560
Needs Improvement    13926
Poor                  8641
Extremely Poor        2144
Name: Onboard_Entertainment, dtype: int64
--------------------------------------------------
Good                 30016
Excellent            25894
Acceptable           15702
Needs Improvement    12508
Poor                 10167
Extremely Poor           1
Name: Online_Support, dtype: int64
--------------------------------------------------
Good                 28909
Excellent            24744
Acceptable           16390
Needs Improvement    14479
Poor                  9768
Extremely Poor          16
Name: Ease_of_Online_Booking, dtype: int64
--------------------------------------------------
Good                 27265
Excellent            21272
Acceptable           18071
Needs Improvement    11390
Poor                  8776
Extremely Poor           4
Name: Onboard_Service, dtype: int64
--------------------------------------------------
Good                 28870
Excellent            24832
Acceptable           16384
Needs Improvement    15753
Poor                  8110
Extremely Poor         340
Name: Legroom, dtype: int64
--------------------------------------------------
Good                 34944
Excellent            26003
Acceptable           17767
Needs Improvement     9759
Poor                  5764
Name: Baggage_Handling, dtype: int64
--------------------------------------------------
Good                 26502
Acceptable           25803
Excellent            19641
Needs Improvement    11218
Poor                 11137
Extremely Poor           1
Name: CheckIn_Service, dtype: int64
--------------------------------------------------
Good                 35427
Excellent            26053
Acceptable           17449
Needs Improvement     9806
Poor                  5633
Extremely Poor           5
Name: Cleanliness, dtype: int64
--------------------------------------------------
Good                 25533
Acceptable           22475
Excellent            21742
Needs Improvement    13451
Poor                 11160
Extremely Poor          12
Name: Online_Boarding, dtype: int64
--------------------------------------------------
In [ ]:
train.head(1)
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
0 98800001 Female Loyal Customer 52.0 NaN Business 272 0.0 5.0 0 Needs Improvement Green Car Excellent Excellent Very Convenient Good Needs Improvement Acceptable Needs Improvement Needs Improvement Acceptable Needs Improvement Good Needs Improvement Poor
In [ ]:
train.tail(1)
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
94378 98894379 Male Loyal Customer 54.0 NaN Eco 2107 28.0 28.0 0 Acceptable Ordinary Poor Acceptable Manageable Acceptable Acceptable Acceptable Acceptable Poor Good Good Poor Good Acceptable

Observations:

  1. The following data are binary:
  • Gender - Male or Female
  • Customer - Loyal or Disloyal
  • Type_travel - Business or Personal
  • Travel Class - Eco or Business
  • Seat Class - Green car or Ordinary
  1. Platform Location is measured in:
  • Very Convenient
  • Convenient
  • Manageable
  • Needs Improvement
  • Inconvenient
  • Very Inconvenient
  1. All the other variables are measured by the scale
  • Excellent
  • Good
  • Acceptable
  • Needs improvement
  • Poor
  • Extremely poor
  1. Platform location and the other variables can be encoded into numerical values from 0 being the most unsatisfactory rating to 5 being the most satisfactory rating.
In [ ]:
# Defining a function to encode the ratings from categorical to numerical
def cat_to_numerical(x):
    if x=="Excellent":
        return 5
    elif x=="Good":
        return 4
    elif x=="Acceptable":
        return 3
    elif x=="Needs Improvement":
        return 2
    elif x=="Poor":
        return 1
    elif x=="Extremely Poor":
        return 0
    else:
        return x
In [ ]:
appreciation_variables = ['Seat_Comfort',
       'Arrival_Time_Convenient', 'Catering',
       'Onboard_Wifi_Service', 'Onboard_Entertainment', 'Online_Support',
       'Ease_of_Online_Booking', 'Onboard_Service', 'Legroom',
       'Baggage_Handling', 'CheckIn_Service', 'Cleanliness',
       'Online_Boarding']

#Converting all features with satisfactory scales to numerical variables
for column in appreciation_variables:
    train[column] = train[column].apply(cat_to_numerical)

#Converting Platform_Location to numerical variables
train['Platform_Location'].replace({'Very Convenient': 5,
                                    'Convenient': 4,
                                    'Manageable': 3,
                                    'Needs Improvement': 2,
                                    'Inconvenient': 1,
                                    'Very Inconvenient': 0}, inplace=True)
In [ ]:
train.head(1)
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
0 98800001 Female Loyal Customer 52.0 NaN Business 272 0.0 5.0 0 2.0 Green Car 5.0 5.0 5.0 4.0 2.0 3.0 2.0 2.0 3.0 2.0 4.0 2.0 1.0
In [ ]:
train.tail(1)
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
94378 98894379 Male Loyal Customer 54.0 NaN Eco 2107 28.0 28.0 0 3.0 Ordinary 1.0 3.0 3.0 3.0 3.0 3.0 3.0 1.0 4.0 4.0 1.0 4.0 3.0
In [ ]:
train.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
ID 94379.0 9.884719e+07 27245.014865 98800001.0 98823595.5 98847190.0 98870784.5 98894379.0
Age 94346.0 3.941965e+01 15.116632 7.0 27.0 40.0 51.0 85.0
Travel_Distance 94379.0 1.978888e+03 1027.961019 50.0 1359.0 1923.0 2538.0 6951.0
Departure_Delay_in_Mins 94322.0 1.464709e+01 38.138781 0.0 0.0 0.0 12.0 1592.0
Arrival_Delay_in_Mins 94022.0 1.500522e+01 38.439409 0.0 0.0 0.0 13.0 1584.0
Overall_Experience 94379.0 5.466576e-01 0.497821 0.0 0.0 1.0 1.0 1.0
Seat_Comfort 94318.0 2.839182e+00 1.392526 0.0 2.0 3.0 4.0 5.0
Arrival_Time_Convenient 85449.0 2.994991e+00 1.526280 0.0 2.0 3.0 4.0 5.0
Catering 85638.0 2.853511e+00 1.443945 0.0 2.0 3.0 4.0 5.0
Platform_Location 94349.0 2.990864e+00 1.308233 0.0 2.0 3.0 4.0 5.0
Onboard_Wifi_Service 94349.0 3.248227e+00 1.319520 0.0 2.0 3.0 4.0 5.0
Onboard_Entertainment 94361.0 3.382510e+00 1.346190 0.0 2.0 4.0 4.0 5.0
Online_Support 94288.0 3.519250e+00 1.308174 0.0 3.0 4.0 5.0 5.0
Ease_of_Online_Booking 94306.0 3.470108e+00 1.305546 0.0 2.0 4.0 5.0 5.0
Onboard_Service 86778.0 3.470799e+00 1.268574 0.0 3.0 4.0 4.0 5.0
Legroom 94289.0 3.482994e+00 1.292260 0.0 2.0 4.0 5.0 5.0
Baggage_Handling 94237.0 3.696786e+00 1.156399 1.0 3.0 4.0 5.0 5.0
CheckIn_Service 94302.0 3.342400e+00 1.260307 0.0 3.0 3.0 4.0 5.0
Cleanliness 94373.0 3.704078e+00 1.151988 0.0 3.0 4.0 5.0 5.0
Online_Boarding 94373.0 3.351901e+00 1.298061 0.0 2.0 4.0 4.0 5.0

Exploratory Data Analysis¶

Univariate Analysis¶

Functions¶

In [ ]:
def labeled_countplot(data,feature,perc = False, n = None, order = None):
  total = len(data[feature])
  count = data[feature].nunique()

  #Changing size of the plot
  if n is None:
    plt.figure(figsize = (count + 1, 5)) #if n is not specified, then the size of the chart will be the according to number of features
  else:
    plt.figure(figsize = (n + 1, 5))

  #Rotate the x labels
  plt.xticks(rotation = 90)

  #Creating the order sequence for countplots
  satisfaction_scale = ['Excellent','Good','Acceptable','Needs Improvement','Poor','Extremely Poor']
  location_scale = ['Very Convenient', 'Convenient','Manageable','Needs Improvement','Inconvenient','Very Inconvenient']
  numerical_scale = [5,4,3,2,1,0]

  #Create the countplot and assigning it to object
  if order == True:
    ax = sns.histplot(data = data,
                      x = feature,
                      bins = 6,
                      binwidth = 1,
                      discrete = True)

  elif order == None:
    ax = sns.countplot(data = data,
                      x = feature,
                      palette = "Paired",
                      order = data.groupby([feature])['ID'].count().sort_values(ascending = False).index)

  else:
    ax = sns.countplot(data = data,
                      x = feature,
                      palette = "Paired")

  #Creating the labels
  for p in ax.patches:
    if perc == True:
      label = "{:.1f}%".format(100*p.get_height()/total) #Gets the percentage value of the height
    else:
      label = p.get_height() # Just get the height without percentage

    #Getting coordinates for the annotation
    x = p.get_x() + p.get_width()/2
    y = p.get_height()

    #Coding the annotations
    ax.annotate(label,(x,y),
                ha = "center",
                va = "center",
                size = 12,
                xytext = (0,5),
                textcoords = "offset points")

  plt.show()

  #Print out the number of null values
  missing = len(data.loc[data[feature].isnull() == True])
  print('Number of null values: ',missing)
In [ ]:
# Function to plot a boxplot and a histogram along the same scale

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None, whis = 1.5,outliers = True, mean = True, median = True):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
     #Here they set the axes to 2 variables, but in the other project we set to 1 variable. That 1 variable is a tuple of 2 axes, so we need to use ax[0] and ax[1] to define the axies for each subplot
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid = 2
                              sharex = True,  # x-axis will be shared among all subplots
                              gridspec_kw = {"height_ratios": (0.25, 0.75)}, #This sets the 2 subplots' height ratios, with top one taking 25% of the total figure
                              figsize = figsize) # Creating the 2 subplots

    # Create Boxplot that shows mean
    sns.boxplot(data = data,
                x = feature,
                whis = whis,
                showfliers =  outliers,
                ax = ax_box2,
                showmeans = True,
                color = "violet")

    # Create Histogram
    sns.histplot(data = data,
                 x = feature,
                 kde = kde,
                 ax = ax_hist2,
                 bins = bins, #Since the bins cannot = non-integer, we need this second part of code
                 palette = "winter") if bins else sns.histplot(data = data,
                                                               x = feature,
                                                               kde = kde,
                                                               ax = ax_hist2)
    # Add mean to the histogram
    if mean == True:
      ax_hist2.axvline(data[feature].mean(),
                      color = "green",
                      linestyle = "--")
    else:
      pass

    # Add median to the histogram
    if median == True:
      ax_hist2.axvline(data[feature].median(),
                      color = "black",
                      linestyle = "-")
    else:
      pass
In [ ]:
# Function to plot stacked bar plots

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique() #This pulls out the number of unique values in the column
    sorter = data[target].value_counts().index[-1] #this tells you the least frequent value, as it sorts based on the frequency, and -1 means the least frequent

    #Create the stacked barplot for understanding
    tab1 = pd.crosstab(data[predictor],
                       data[target],
                       margins = True).sort_values(by = sorter, #Sort based on the least frequent value
                                                   ascending = False)
    print(tab1)
    print("-" * 120)

    #Create the stacked barplot for visualizing
    tab = pd.crosstab(data[predictor],
                      data[target],
                      normalize = "index").sort_values(by = sorter,
                                                       ascending = False)

    tab.plot(kind = "bar",
             stacked = True,
             figsize = (count + 1, 5))

    plt.legend(loc = "lower left",frameon = False,)
    plt.legend(loc = "upper left", bbox_to_anchor = (1, 1))
    plt.show()

Numerical Variables¶

Age¶

In [ ]:
histogram_boxplot(train, 'Age', figsize=(12, 7), kde=False, bins = 79)
In [ ]:
train['Age'].isnull().sum()
Out[ ]:
33

Observations:

  1. There are 2 very common age: 26 and 40 years old.
  2. There is a dip in number of passengers around the age of 30.
  3. No outliers.
  4. There are 33 missing values, safe to substitute with average values.

Travel_Distance¶

In [ ]:
histogram_boxplot(train, 'Travel_Distance', figsize=(12, 7), kde=False, bins = None)
In [ ]:
#Understanding the outliers in the upper range of Travel_Distance
train.loc[train['Travel_Distance'] > 6900].sort_values(by = 'Travel_Distance', ascending = False)
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
13169 98813170 Male NaN 42.0 Business Travel Business 6951 1.0 25.0 0 2.0 Ordinary 2.0 2.0 5.0 2.0 1.0 1.0 3.0 3.0 4.0 5.0 1.0 4.0 1.0
2248 98802249 Male Loyal Customer 46.0 Business Travel Business 6950 0.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 4.0 4.0 4.0 1.0 3.0 5.0 5.0 4.0 3.0 4.0
87232 98887233 Female Loyal Customer 45.0 Business Travel Business 6948 0.0 13.0 1 1.0 Green Car 1.0 1.0 1.0 2.0 1.0 1.0 3.0 3.0 5.0 4.0 1.0 3.0 1.0
67206 98867207 Female Loyal Customer 57.0 Personal Travel Eco 6924 12.0 17.0 0 3.0 Green Car 4.0 3.0 5.0 3.0 3.0 3.0 3.0 1.0 5.0 1.0 3.0 3.0 3.0
6839 98806840 Male Loyal Customer 44.0 Business Travel Business 6907 0.0 0.0 1 5.0 Ordinary 5.0 5.0 5.0 5.0 4.0 4.0 3.0 2.0 3.0 4.0 4.0 4.0 4.0
8478 98808479 Female Loyal Customer 29.0 Personal Travel Eco 6907 6.0 0.0 0 2.0 Green Car 4.0 2.0 5.0 2.0 4.0 4.0 2.0 2.0 4.0 4.0 4.0 3.0 4.0
In [ ]:
train['Travel_Distance'].isnull().sum()
Out[ ]:
0

Observations:

  1. Data has a normal distribution between 1000 to 3000 km of travel distance
  2. There is another peak at around 400 km of travel distance, indicating that it there are 2 clusters of travel distance, one with centered around 400km and the other centered around 1800 km.
  3. There are also many travels that are above the upper whiskers, but upon close inspection they may not be considered outliers as it is possible for the trips to go up to 6900 km.
  4. Most of the passengers that travelled distances larger than 6900 km are loyal customers
  5. No missing values

Departure Delay in Mins¶

In [ ]:
histogram_boxplot(train, 'Departure_Delay_in_Mins', figsize=(12, 7), kde=False, bins = None, median = False)
In [ ]:
#Studying the outliers in Departure_delay
dep_delay = train.loc[train['Departure_Delay_in_Mins'] > 600].sort_values(by = 'Departure_Delay_in_Mins', ascending = False)
dep_delay
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
12755 98812756 Female Loyal Customer 47.0 NaN Eco 3113 1592.0 1584.0 0 2.0 Ordinary 2.0 2.0 3.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 3.0 2.0
40837 98840838 Male Loyal Customer 32.0 NaN Business 4425 1305.0 1280.0 1 2.0 Green Car 2.0 2.0 2.0 5.0 5.0 5.0 3.0 4.0 5.0 4.0 5.0 3.0 5.0
22845 98822846 NaN NaN 8.0 Personal Travel Eco 3017 1128.0 1115.0 0 2.0 Ordinary 5.0 2.0 2.0 2.0 1.0 1.0 4.0 4.0 3.0 5.0 1.0 3.0 1.0
62544 98862545 Male Loyal Customer 49.0 Business Travel Business 3792 1017.0 1011.0 1 1.0 Green Car 1.0 1.0 1.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 4.0 4.0 4.0
44159 98844160 Female Loyal Customer 39.0 Business Travel Business 3549 951.0 940.0 0 1.0 Ordinary 4.0 4.0 4.0 1.0 1.0 1.0 2.0 1.0 3.0 3.0 1.0 1.0 1.0
30723 98830724 Male Loyal Customer 47.0 Business Travel Business 3835 933.0 920.0 1 4.0 Ordinary 5.0 4.0 4.0 4.0 4.0 4.0 5.0 4.0 5.0 5.0 4.0 4.0 4.0
26855 98826856 Female Loyal Customer 53.0 Business Travel Business 4198 930.0 952.0 0 3.0 Ordinary 4.0 4.0 4.0 3.0 3.0 3.0 3.0 5.0 2.0 2.0 3.0 4.0 3.0
55955 98855956 Male NaN 27.0 Business Travel Business 3623 859.0 860.0 1 1.0 Ordinary 1.0 4.0 1.0 4.0 5.0 5.0 4.0 3.0 5.0 4.0 5.0 4.0 5.0
37662 98837663 Male NaN 15.0 Business Travel Business 5865 853.0 823.0 0 2.0 Green Car 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 4.0 3.0 2.0 3.0 2.0
27510 98827511 Female Loyal Customer 42.0 Business Travel Business 2668 815.0 822.0 0 2.0 Green Car NaN 1.0 1.0 2.0 2.0 2.0 2.0 5.0 3.0 3.0 2.0 4.0 2.0
81053 98881054 Female Loyal Customer 45.0 Personal Travel Eco 2460 794.0 795.0 0 1.0 Green Car 4.0 1.0 1.0 1.0 5.0 5.0 3.0 4.0 4.0 5.0 5.0 5.0 5.0
75910 98875911 Male NaN 23.0 Business Travel Eco 2563 750.0 729.0 1 5.0 Ordinary 2.0 2.0 2.0 5.0 5.0 5.0 1.0 1.0 2.0 4.0 5.0 1.0 5.0
65872 98865873 Female Loyal Customer 30.0 Business Travel Business 4124 748.0 720.0 0 2.0 Ordinary 1.0 NaN 1.0 2.0 2.0 2.0 4.0 2.0 4.0 4.0 2.0 1.0 2.0
6952 98806953 Male Loyal Customer 33.0 Personal Travel Eco 2832 726.0 691.0 0 1.0 Ordinary 5.0 0.0 3.0 0.0 1.0 1.0 4.0 2.0 5.0 5.0 1.0 4.0 1.0
63313 98863314 Female Loyal Customer 7.0 Personal Travel Eco 1990 724.0 705.0 0 1.0 Ordinary 5.0 0.0 1.0 0.0 4.0 4.0 4.0 4.0 5.0 1.0 4.0 2.0 4.0
93337 98893338 Female Loyal Customer 42.0 Personal Travel Business 2256 692.0 702.0 0 2.0 Green Car 3.0 2.0 2.0 2.0 3.0 3.0 1.0 3.0 3.0 2.0 3.0 3.0 3.0
40077 98840078 Female NaN 39.0 Business Travel Eco 2431 652.0 638.0 0 3.0 Green Car 3.0 3.0 3.0 3.0 4.0 4.0 5.0 3.0 1.0 5.0 4.0 1.0 4.0
91267 98891268 Male Loyal Customer 48.0 NaN Eco 4318 626.0 604.0 0 1.0 Ordinary 4.0 1.0 4.0 1.0 1.0 1.0 3.0 3.0 2.0 2.0 1.0 2.0 1.0
7650 98807651 Female Loyal Customer 43.0 Business Travel Business 3882 624.0 615.0 1 3.0 Green Car 3.0 3.0 3.0 4.0 4.0 4.0 3.0 4.0 5.0 4.0 4.0 3.0 4.0
94324 98894325 Male NaN 35.0 Business Travel Business 2592 610.0 593.0 1 5.0 Green Car 5.0 NaN 5.0 3.0 3.0 3.0 2.0 2.0 3.0 3.0 3.0 1.0 3.0
In [ ]:
len(dep_delay)
Out[ ]:
20
In [ ]:
train['Departure_Delay_in_Mins'].isnull().sum()
Out[ ]:
57

Observations:

  1. It can be seen that many of the delays are very short and congregate within 10 minutes.
  2. However, there are a number of incidences where there is a delay of 1600 minutes, which is very long and considered an outlier.
  3. Only 20 trips had a delay of more than 10 hours, perhaps this can be a cutoff point to declare outliers since it does not happen as often?
  4. 57 Missing values

Arrival Delay in Mins¶

In [ ]:
histogram_boxplot(train, 'Arrival_Delay_in_Mins', figsize=(12, 7), kde=False, bins = None, median = False)
In [ ]:
arr_delay = train.loc[train['Arrival_Delay_in_Mins'] > 600].sort_values(by = 'Arrival_Delay_in_Mins', ascending = False)
arr_delay
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
12755 98812756 Female Loyal Customer 47.0 NaN Eco 3113 1592.0 1584.0 0 2.0 Ordinary 2.0 2.0 3.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 3.0 2.0
40837 98840838 Male Loyal Customer 32.0 NaN Business 4425 1305.0 1280.0 1 2.0 Green Car 2.0 2.0 2.0 5.0 5.0 5.0 3.0 4.0 5.0 4.0 5.0 3.0 5.0
22845 98822846 NaN NaN 8.0 Personal Travel Eco 3017 1128.0 1115.0 0 2.0 Ordinary 5.0 2.0 2.0 2.0 1.0 1.0 4.0 4.0 3.0 5.0 1.0 3.0 1.0
62544 98862545 Male Loyal Customer 49.0 Business Travel Business 3792 1017.0 1011.0 1 1.0 Green Car 1.0 1.0 1.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 4.0 4.0 4.0
26855 98826856 Female Loyal Customer 53.0 Business Travel Business 4198 930.0 952.0 0 3.0 Ordinary 4.0 4.0 4.0 3.0 3.0 3.0 3.0 5.0 2.0 2.0 3.0 4.0 3.0
44159 98844160 Female Loyal Customer 39.0 Business Travel Business 3549 951.0 940.0 0 1.0 Ordinary 4.0 4.0 4.0 1.0 1.0 1.0 2.0 1.0 3.0 3.0 1.0 1.0 1.0
30723 98830724 Male Loyal Customer 47.0 Business Travel Business 3835 933.0 920.0 1 4.0 Ordinary 5.0 4.0 4.0 4.0 4.0 4.0 5.0 4.0 5.0 5.0 4.0 4.0 4.0
55955 98855956 Male NaN 27.0 Business Travel Business 3623 859.0 860.0 1 1.0 Ordinary 1.0 4.0 1.0 4.0 5.0 5.0 4.0 3.0 5.0 4.0 5.0 4.0 5.0
37662 98837663 Male NaN 15.0 Business Travel Business 5865 853.0 823.0 0 2.0 Green Car 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 4.0 3.0 2.0 3.0 2.0
27510 98827511 Female Loyal Customer 42.0 Business Travel Business 2668 815.0 822.0 0 2.0 Green Car NaN 1.0 1.0 2.0 2.0 2.0 2.0 5.0 3.0 3.0 2.0 4.0 2.0
81053 98881054 Female Loyal Customer 45.0 Personal Travel Eco 2460 794.0 795.0 0 1.0 Green Car 4.0 1.0 1.0 1.0 5.0 5.0 3.0 4.0 4.0 5.0 5.0 5.0 5.0
75910 98875911 Male NaN 23.0 Business Travel Eco 2563 750.0 729.0 1 5.0 Ordinary 2.0 2.0 2.0 5.0 5.0 5.0 1.0 1.0 2.0 4.0 5.0 1.0 5.0
65872 98865873 Female Loyal Customer 30.0 Business Travel Business 4124 748.0 720.0 0 2.0 Ordinary 1.0 NaN 1.0 2.0 2.0 2.0 4.0 2.0 4.0 4.0 2.0 1.0 2.0
63313 98863314 Female Loyal Customer 7.0 Personal Travel Eco 1990 724.0 705.0 0 1.0 Ordinary 5.0 0.0 1.0 0.0 4.0 4.0 4.0 4.0 5.0 1.0 4.0 2.0 4.0
93337 98893338 Female Loyal Customer 42.0 Personal Travel Business 2256 692.0 702.0 0 2.0 Green Car 3.0 2.0 2.0 2.0 3.0 3.0 1.0 3.0 3.0 2.0 3.0 3.0 3.0
6952 98806953 Male Loyal Customer 33.0 Personal Travel Eco 2832 726.0 691.0 0 1.0 Ordinary 5.0 0.0 3.0 0.0 1.0 1.0 4.0 2.0 5.0 5.0 1.0 4.0 1.0
40077 98840078 Female NaN 39.0 Business Travel Eco 2431 652.0 638.0 0 3.0 Green Car 3.0 3.0 3.0 3.0 4.0 4.0 5.0 3.0 1.0 5.0 4.0 1.0 4.0
7650 98807651 Female Loyal Customer 43.0 Business Travel Business 3882 624.0 615.0 1 3.0 Green Car 3.0 3.0 3.0 4.0 4.0 4.0 3.0 4.0 5.0 4.0 4.0 3.0 4.0
91267 98891268 Male Loyal Customer 48.0 NaN Eco 4318 626.0 604.0 0 1.0 Ordinary 4.0 1.0 4.0 1.0 1.0 1.0 3.0 3.0 2.0 2.0 1.0 2.0 1.0
In [ ]:
len(arr_delay)
Out[ ]:
19
In [ ]:
train['Arrival_Delay_in_Mins'].isnull().sum()
Out[ ]:
357

Observations:

  1. The distribution of data is very similar to departure delays, with many arrival delays approximately 10 minutes
  2. There are 19 trips which had a delay of more than 10 hours.
  3. 357 missing data, way more than the number of departure delays.

Categorical Variables¶

Gender¶

In [ ]:
sns.countplot(data=train,x='Gender')
Out[ ]:
<Axes: xlabel='Gender', ylabel='count'>
In [ ]:
train['Gender'].isnull().sum()
Out[ ]:
77

Observations:

  1. Women and men are equally distributed in the dataset, 50,7% of the passenger are female and 49.3% are men.
  2. We know the gender of 91.9% of the passengers, only 77 values are missing for this feature.

Travel Class¶

In [ ]:
labeled_countplot(train,'Travel_Class', perc = True,order = False)
Number of null values:  0

Observations:

  • More Eco than business travels, which makes sense
  • No null values

Platform Location¶

In [ ]:
labeled_countplot(train,'Platform_Location', perc = True, order = True)
Number of null values:  30
In [ ]:
#During the earlier phases, we caught that there are 2 passengers who gave a 'very inconvenient' rating
train['Platform_Location'].value_counts()
Out[ ]:
3.0    24173
4.0    21912
2.0    17832
1.0    16449
5.0    13981
0.0        2
Name: Platform_Location, dtype: int64
In [ ]:
#understanding who are the 2 passengers who rated very inconvenient
train.loc[train['Platform_Location'] == 0]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
49219 98849220 Female Loyal Customer 40.0 Personal Travel Eco 1968 0.0 0.0 1 1.0 Ordinary 0.0 5.0 0.0 4.0 4.0 4.0 1.0 1.0 1.0 1.0 3.0 1.0 4.0
79337 98879338 Female Loyal Customer 55.0 Business Travel Business 2063 0.0 0.0 1 1.0 Green Car 0.0 5.0 0.0 4.0 4.0 4.0 1.0 1.0 1.0 1.0 3.0 1.0 4.0
In [ ]:
#Taking a look at the missing values
missing_pl = train.loc[train['Platform_Location'].isnull() == True]
missing_pl
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
2215 98802216 Male Loyal Customer 16.0 Personal Travel Eco 2977 0.0 0.0 0 2.0 Green Car NaN NaN NaN NaN NaN NaN 4.0 5.0 1.0 5.0 1.0 3.0 4.0
4477 98804478 Female Loyal Customer 69.0 Personal Travel Eco 881 4.0 18.0 0 3.0 Green Car NaN NaN NaN NaN NaN NaN 2.0 2.0 3.0 3.0 3.0 2.0 5.0
6694 98806695 Female Loyal Customer 35.0 Business Travel Business 2748 0.0 0.0 0 3.0 Ordinary 3.0 3.0 NaN NaN 4.0 4.0 3.0 3.0 3.0 3.0 1.0 3.0 4.0
8775 98808776 Female Loyal Customer 25.0 Business Travel Business 2815 0.0 0.0 1 5.0 Ordinary NaN NaN NaN NaN NaN NaN 4.0 NaN 2.0 5.0 5.0 4.0 4.0
14620 98814621 Male Loyal Customer 58.0 NaN Business 1575 0.0 0.0 1 2.0 Ordinary 2.0 2.0 NaN NaN 5.0 4.0 2.0 2.0 2.0 2.0 4.0 2.0 4.0
15271 98815272 Female Loyal Customer 13.0 NaN Eco 3151 0.0 0.0 1 5.0 Ordinary 5.0 5.0 NaN NaN 5.0 4.0 4.0 2.0 5.0 5.0 4.0 4.0 4.0
17607 98817608 Male Disloyal Customer 37.0 Business Travel Business 2007 4.0 16.0 0 1.0 Ordinary NaN NaN NaN NaN NaN NaN 5.0 5.0 3.0 5.0 5.0 4.0 5.0
22867 98822868 Female Loyal Customer 47.0 Personal Travel Eco 406 0.0 0.0 1 3.0 Ordinary NaN NaN NaN NaN NaN NaN 4.0 3.0 3.0 4.0 4.0 5.0 4.0
29893 98829894 Male Loyal Customer 47.0 Business Travel Business 3095 0.0 0.0 1 4.0 Green Car 1.0 4.0 NaN NaN 5.0 5.0 4.0 4.0 4.0 4.0 5.0 4.0 4.0
32086 98832087 Male Loyal Customer 39.0 Business Travel Business 302 87.0 90.0 1 4.0 Ordinary 4.0 4.0 NaN NaN 5.0 4.0 4.0 4.0 4.0 NaN NaN 4.0 4.0
35373 98835374 Male Loyal Customer 47.0 Business Travel Business 738 13.0 1.0 1 3.0 Green Car NaN NaN NaN NaN NaN NaN 5.0 5.0 5.0 5.0 5.0 5.0 3.0
37897 98837898 Female Disloyal Customer 39.0 Business Travel Business 1530 0.0 0.0 0 2.0 Green Car NaN 2.0 NaN NaN 2.0 3.0 5.0 2.0 1.0 3.0 3.0 5.0 5.0
41688 98841689 Male Loyal Customer 52.0 Business Travel Business 3535 0.0 0.0 0 3.0 Green Car NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 2.0 3.0 2.0
45382 98845383 Male NaN 27.0 Business Travel Business 3096 0.0 1.0 1 1.0 Ordinary 4.0 1.0 NaN NaN 2.0 2.0 2.0 4.0 3.0 4.0 5.0 5.0 2.0
47006 98847007 Female Disloyal Customer 25.0 Business Travel Eco 1098 0.0 0.0 0 1.0 Ordinary NaN NaN NaN NaN NaN NaN 3.0 5.0 3.0 NaN NaN 2.0 3.0
53010 98853011 Female Loyal Customer 67.0 Personal Travel Eco 1140 50.0 46.0 1 4.0 Ordinary NaN NaN NaN NaN NaN NaN 2.0 2.0 4.0 2.0 4.0 2.0 4.0
53500 98853501 Male Loyal Customer 47.0 Business Travel Business 1990 0.0 3.0 1 5.0 Green Car NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 2.0 3.0 5.0
56497 98856498 Female Loyal Customer 43.0 Business Travel Eco 701 0.0 0.0 1 4.0 Green Car 3.0 3.0 NaN NaN 4.0 1.0 4.0 4.0 4.0 4.0 2.0 4.0 3.0
59270 98859271 Male Loyal Customer 22.0 Personal Travel Eco 2220 39.0 41.0 0 3.0 Ordinary NaN NaN NaN NaN NaN NaN 1.0 5.0 2.0 5.0 5.0 5.0 1.0
59964 98859965 Male Disloyal Customer 23.0 Business Travel Eco 2013 0.0 0.0 0 2.0 Green Car 2.0 2.0 NaN NaN 2.0 2.0 4.0 3.0 5.0 5.0 5.0 5.0 4.0
62201 98862202 Female NaN 26.0 Business Travel Eco 2074 2.0 0.0 1 0.0 Ordinary NaN NaN NaN NaN NaN NaN 3.0 5.0 2.0 4.0 4.0 3.0 3.0
62699 98862700 Male Loyal Customer 49.0 Business Travel Business 960 0.0 0.0 1 3.0 Green Car 3.0 4.0 NaN NaN 4.0 5.0 2.0 2.0 2.0 2.0 5.0 2.0 3.0
66959 98866960 Male Disloyal Customer 22.0 Business Travel Eco 1418 14.0 23.0 0 4.0 Ordinary 4.0 4.0 NaN NaN 4.0 3.0 3.0 1.0 3.0 3.0 2.0 3.0 3.0
69915 98869916 Male Loyal Customer 46.0 Business Travel Business 366 8.0 0.0 1 4.0 Ordinary NaN NaN NaN NaN NaN NaN 5.0 5.0 5.0 5.0 3.0 5.0 5.0
70266 98870267 Female Loyal Customer 66.0 Business Travel Business 3207 0.0 0.0 1 0.0 Green Car NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 5.0 3.0 5.0
76005 98876006 Male Loyal Customer 68.0 Personal Travel Eco 2246 3.0 0.0 0 2.0 Green Car NaN NaN NaN NaN NaN NaN 2.0 4.0 4.0 2.0 1.0 1.0 2.0
84335 98884336 Male Disloyal Customer 36.0 Business Travel Eco 1595 8.0 5.0 0 2.0 Green Car 2.0 1.0 NaN NaN 1.0 5.0 5.0 4.0 5.0 NaN NaN 4.0 5.0
85184 98885185 Female Loyal Customer 48.0 Personal Travel Business 2926 0.0 0.0 1 5.0 Green Car NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 2.0 4.0 1.0
89394 98889395 Female Loyal Customer 49.0 Personal Travel Eco 210 0.0 0.0 1 4.0 Ordinary NaN NaN NaN NaN NaN NaN 1.0 4.0 3.0 2.0 4.0 3.0 1.0
91055 98891056 Male Loyal Customer 42.0 Business Travel Eco 2272 0.0 0.0 0 2.0 Green Car NaN NaN NaN NaN NaN NaN 2.0 4.0 2.0 3.0 1.0 3.0 2.0

Observations:

  1. People are generally satisfied with platform location, majority of the people find the platform location manageable and above.
  2. There are only 2 people who found the platform location very inconvenient. But these 2 people were satisfied with the overall experience anyways
  3. Whenever the platform location values are missing, the onboard wifi service value will be missing too
  4. This could be explained that the location of the shinkansen station is at a low developed region with poor connectivity. As such, the location of the station is harder to locate and access.
  5. As there is only 30 data that are missing from platform location, it is safe to replace these missing values as the average values

Seat Class¶

In [ ]:
labeled_countplot(train,'Seat_Class', perc = False,order = False)
Number of null values:  0

Observations:

  1. There are slightly more green cars than ordinary class seats.
  2. There are also no missing values for seat class
  3. Since seat class is closely related to legroom and seat_comfort, we can see if they have similar distribution and combine the features together

Seat Comfort¶

In [ ]:
labeled_countplot(train,'Seat_Comfort', perc = True, order = True)
Number of null values:  61
In [ ]:
train.loc[train['Seat_Comfort'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
560 98800561 Male Loyal Customer 30.0 Business Travel Eco 1495 0.0 0.0 0 NaN Green Car NaN 4.0 2.0 4.0 4.0 4.0 4.0 2.0 4.0 3.0 1.0 3.0 4.0
2214 98802215 Male Loyal Customer 30.0 NaN Business 4725 86.0 77.0 1 NaN Green Car NaN 1.0 3.0 4.0 4.0 4.0 4.0 NaN NaN NaN 4.0 5.0 4.0
4754 98804755 Male Loyal Customer 63.0 Personal Travel Eco 1380 0.0 0.0 0 NaN Ordinary NaN 2.0 3.0 5.0 2.0 1.0 5.0 NaN NaN NaN 5.0 4.0 5.0
5191 98805192 Male NaN 33.0 Business Travel Business 1436 6.0 25.0 1 NaN Green Car NaN 3.0 3.0 4.0 4.0 4.0 4.0 3.0 5.0 4.0 4.0 3.0 4.0
8809 98808810 Female Loyal Customer 12.0 Personal Travel Eco 3114 41.0 44.0 1 NaN Green Car NaN 3.0 3.0 4.0 4.0 4.0 2.0 2.0 2.0 2.0 3.0 2.0 4.0
8851 98808852 Female Disloyal Customer 24.0 Business Travel Business 2910 0.0 0.0 1 NaN Ordinary NaN 0.0 5.0 3.0 0.0 3.0 3.0 3.0 4.0 4.0 4.0 5.0 3.0
8936 98808937 Female Loyal Customer 56.0 Business Travel Eco 273 0.0 0.0 1 NaN Ordinary NaN 2.0 2.0 1.0 1.0 3.0 3.0 NaN NaN NaN 4.0 3.0 4.0
9472 98809473 Female Loyal Customer 36.0 Business Travel Business 1453 0.0 0.0 1 NaN Ordinary NaN 1.0 1.0 5.0 4.0 5.0 3.0 3.0 3.0 3.0 5.0 3.0 3.0
10830 98810831 Female Loyal Customer 34.0 Personal Travel Eco 1336 16.0 40.0 0 NaN Green Car NaN 3.0 4.0 1.0 3.0 1.0 1.0 4.0 5.0 5.0 5.0 4.0 1.0
12651 98812652 Female Disloyal Customer 44.0 Business Travel Eco 1920 49.0 44.0 0 NaN Green Car NaN 3.0 3.0 5.0 2.0 5.0 5.0 1.0 2.0 4.0 4.0 4.0 5.0
17855 98817856 Female Loyal Customer 56.0 Personal Travel Eco 1207 0.0 0.0 1 NaN Green Car NaN 5.0 5.0 4.0 5.0 4.0 5.0 5.0 5.0 5.0 3.0 5.0 3.0
18241 98818242 Female Loyal Customer 15.0 Personal Travel Business 2767 1.0 16.0 0 NaN Green Car NaN 3.0 4.0 1.0 3.0 3.0 1.0 NaN 2.0 3.0 4.0 4.0 1.0
19627 98819628 Male NaN 53.0 Business Travel Eco 1676 0.0 0.0 1 NaN Green Car NaN 4.0 4.0 5.0 5.0 5.0 5.0 1.0 1.0 2.0 1.0 4.0 5.0
19672 98819673 Male Loyal Customer 28.0 Personal Travel Business 3849 0.0 0.0 0 NaN Green Car NaN 3.0 5.0 3.0 3.0 3.0 3.0 5.0 2.0 3.0 1.0 1.0 3.0
20501 98820502 Female Loyal Customer 60.0 Personal Travel Eco 397 0.0 0.0 0 NaN Ordinary NaN 3.0 5.0 3.0 4.0 4.0 2.0 2.0 3.0 2.0 4.0 2.0 5.0
20661 98820662 Female Disloyal Customer 22.0 Business Travel Eco 2211 0.0 0.0 0 NaN Green Car NaN 3.0 3.0 4.0 3.0 4.0 4.0 3.0 4.0 4.0 4.0 5.0 4.0
20788 98820789 Male Disloyal Customer 43.0 Business Travel Eco 2391 0.0 0.0 0 NaN Green Car NaN 3.0 3.0 5.0 3.0 5.0 5.0 1.0 4.0 5.0 5.0 3.0 5.0
23098 98823099 Female Loyal Customer 52.0 Business Travel Business 2146 0.0 0.0 0 NaN Ordinary NaN 5.0 5.0 5.0 4.0 3.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
28858 98828859 Male Disloyal Customer 26.0 Business Travel Business 1112 0.0 0.0 0 NaN Green Car NaN 0.0 3.0 4.0 0.0 4.0 4.0 NaN NaN NaN 3.0 4.0 4.0
29912 98829913 Male Loyal Customer 45.0 Business Travel Business 2060 0.0 0.0 1 NaN Ordinary NaN 5.0 5.0 1.0 3.0 5.0 4.0 4.0 4.0 4.0 4.0 4.0 1.0
30286 98830287 Male Disloyal Customer 49.0 NaN Business 1387 0.0 1.0 0 NaN Ordinary NaN NaN 5.0 4.0 1.0 4.0 4.0 5.0 3.0 5.0 4.0 4.0 4.0
30359 98830360 Female NaN 40.0 Business Travel Business 202 38.0 33.0 1 NaN Green Car NaN NaN 2.0 4.0 4.0 4.0 5.0 5.0 4.0 4.0 5.0 5.0 3.0
30843 98830844 Female Loyal Customer 23.0 Personal Travel Eco 1942 0.0 0.0 1 NaN Green Car NaN 5.0 3.0 1.0 5.0 1.0 1.0 5.0 2.0 2.0 2.0 3.0 1.0
31493 98831494 Male Disloyal Customer 39.0 NaN Business 1700 0.0 0.0 1 NaN Ordinary NaN 5.0 2.0 1.0 5.0 1.0 1.0 2.0 4.0 1.0 3.0 3.0 1.0
31687 98831688 Female Disloyal Customer 43.0 Business Travel Business 2269 0.0 0.0 1 NaN Green Car NaN 4.0 3.0 5.0 4.0 5.0 5.0 4.0 3.0 4.0 3.0 5.0 5.0
33114 98833115 Male Disloyal Customer 30.0 Business Travel Eco 2180 0.0 0.0 0 NaN Ordinary NaN 2.0 4.0 1.0 2.0 1.0 1.0 NaN NaN NaN 5.0 5.0 1.0
35218 98835219 Female Loyal Customer 38.0 Personal Travel Eco 2341 0.0 0.0 1 NaN Ordinary NaN 4.0 1.0 3.0 4.0 5.0 3.0 4.0 5.0 5.0 3.0 5.0 3.0
35691 98835692 Male Loyal Customer 38.0 Business Travel Business 3480 25.0 39.0 0 NaN Green Car NaN 2.0 2.0 5.0 3.0 3.0 3.0 3.0 3.0 2.0 2.0 3.0 3.0
36254 98836255 Male Disloyal Customer 32.0 Business Travel Eco 1706 0.0 2.0 0 NaN Ordinary NaN 3.0 2.0 1.0 3.0 1.0 1.0 NaN NaN NaN 5.0 1.0 1.0
48013 98848014 Female Loyal Customer 38.0 NaN Business 2371 0.0 0.0 0 NaN Ordinary NaN 4.0 4.0 3.0 4.0 4.0 3.0 NaN NaN NaN 4.0 3.0 2.0
48366 98848367 Female NaN 19.0 Business Travel Business 2721 0.0 0.0 1 NaN Ordinary NaN 3.0 3.0 5.0 5.0 5.0 5.0 5.0 3.0 4.0 5.0 4.0 5.0
50444 98850445 Female Loyal Customer 33.0 Business Travel Business 840 10.0 2.0 1 NaN Ordinary NaN 3.0 3.0 3.0 1.0 2.0 4.0 4.0 4.0 4.0 1.0 4.0 5.0
50968 98850969 Male Loyal Customer 47.0 Personal Travel Eco 1764 45.0 46.0 0 NaN Ordinary NaN 2.0 3.0 1.0 2.0 1.0 1.0 3.0 1.0 4.0 4.0 4.0 1.0
53260 98853261 Male Loyal Customer 59.0 Business Travel Eco 1361 0.0 0.0 0 NaN Ordinary NaN 4.0 4.0 2.0 2.0 2.0 2.0 4.0 1.0 4.0 2.0 3.0 2.0
53810 98853811 Female Loyal Customer 49.0 Business Travel Business 467 0.0 0.0 0 NaN Ordinary NaN 2.0 2.0 2.0 4.0 4.0 2.0 2.0 2.0 2.0 1.0 2.0 2.0
57914 98857915 Female Loyal Customer 26.0 Business Travel Eco 1433 0.0 0.0 1 NaN Green Car NaN 0.0 3.0 2.0 0.0 2.0 2.0 1.0 4.0 4.0 2.0 3.0 2.0
60361 98860362 Male Loyal Customer 55.0 Business Travel Business 1002 0.0 6.0 1 NaN Ordinary NaN 3.0 3.0 2.0 5.0 5.0 4.0 4.0 4.0 4.0 3.0 4.0 4.0
62074 98862075 Female Loyal Customer 48.0 Personal Travel Eco 1157 16.0 9.0 1 NaN Ordinary NaN 1.0 1.0 3.0 4.0 4.0 5.0 5.0 5.0 5.0 3.0 5.0 4.0
63042 98863043 Female Loyal Customer 53.0 Business Travel Eco 1087 45.0 49.0 0 NaN Ordinary NaN 4.0 4.0 4.0 3.0 4.0 1.0 NaN 1.0 1.0 3.0 1.0 3.0
67823 98867824 Female Loyal Customer 49.0 Business Travel Business 1917 8.0 6.0 1 NaN Green Car NaN 4.0 4.0 3.0 4.0 4.0 5.0 5.0 5.0 5.0 5.0 5.0 3.0
69351 98869352 Male Loyal Customer 55.0 Business Travel Business 885 12.0 16.0 1 NaN Ordinary NaN 2.0 2.0 2.0 5.0 5.0 4.0 4.0 4.0 4.0 4.0 4.0 3.0
70455 98870456 Female Loyal Customer 39.0 Business Travel Business 325 40.0 39.0 0 NaN Green Car NaN 5.0 5.0 2.0 3.0 3.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0
70589 98870590 Female Loyal Customer 27.0 Personal Travel Eco 948 8.0 10.0 1 NaN Green Car NaN 4.0 4.0 5.0 5.0 4.0 5.0 NaN 2.0 5.0 2.0 5.0 5.0
71072 98871073 Female Loyal Customer 13.0 Personal Travel Eco 2412 0.0 0.0 1 NaN Ordinary NaN 2.0 2.0 3.0 5.0 5.0 4.0 4.0 4.0 4.0 5.0 4.0 3.0
71383 98871384 Female Loyal Customer 42.0 Business Travel Business 2915 0.0 0.0 1 NaN Ordinary NaN 0.0 2.0 5.0 4.0 5.0 3.0 3.0 3.0 3.0 5.0 3.0 4.0
73251 98873252 Male Loyal Customer 47.0 Business Travel Business 2011 0.0 0.0 1 NaN Ordinary NaN 4.0 4.0 4.0 5.0 5.0 4.0 4.0 4.0 4.0 5.0 4.0 5.0
73867 98873868 Female Disloyal Customer 40.0 Business Travel Business 1810 79.0 73.0 0 NaN Green Car NaN 1.0 3.0 2.0 1.0 2.0 2.0 1.0 2.0 4.0 3.0 4.0 2.0
74588 98874589 Male Disloyal Customer 24.0 Business Travel Business 1936 0.0 0.0 1 NaN Ordinary NaN 4.0 3.0 4.0 4.0 4.0 4.0 5.0 5.0 4.0 4.0 5.0 4.0
77480 98877481 Female Loyal Customer 25.0 NaN Eco 3508 6.0 10.0 1 NaN Green Car NaN 2.0 2.0 2.0 5.0 4.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
81399 98881400 Female Loyal Customer 9.0 Business Travel Eco 2278 60.0 53.0 0 NaN Ordinary NaN 4.0 1.0 3.0 3.0 4.0 3.0 NaN NaN NaN 1.0 4.0 3.0
81504 98881505 Female Loyal Customer 47.0 Personal Travel Eco 1076 34.0 65.0 1 NaN Ordinary NaN 0.0 4.0 5.0 5.0 4.0 2.0 NaN NaN NaN 5.0 2.0 3.0
81660 98881661 Male Disloyal Customer 55.0 Business Travel Eco 2033 0.0 10.0 1 NaN Green Car NaN 5.0 2.0 5.0 5.0 5.0 5.0 5.0 1.0 5.0 1.0 4.0 5.0
81876 98881877 Female Loyal Customer 33.0 Business Travel Eco 1713 19.0 23.0 1 NaN Green Car NaN 5.0 5.0 5.0 5.0 5.0 5.0 NaN NaN NaN 4.0 1.0 5.0
84239 98884240 Male Loyal Customer 60.0 NaN Business 1700 0.0 0.0 1 NaN Ordinary NaN 1.0 1.0 2.0 5.0 4.0 4.0 4.0 4.0 4.0 3.0 4.0 4.0
84551 98884552 Male Loyal Customer 60.0 Business Travel Eco 2085 3.0 0.0 0 NaN Ordinary NaN 1.0 1.0 2.0 2.0 2.0 2.0 NaN NaN NaN 3.0 3.0 2.0
85713 98885714 Female Disloyal Customer 26.0 Business Travel Eco 2061 4.0 34.0 0 NaN Green Car NaN 2.0 4.0 2.0 2.0 2.0 2.0 NaN NaN NaN 3.0 4.0 2.0
86883 98886884 Male Loyal Customer 13.0 Personal Travel Eco 2188 4.0 0.0 0 NaN Green Car NaN 3.0 3.0 2.0 3.0 2.0 2.0 4.0 3.0 2.0 4.0 4.0 2.0
87693 98887694 Female Loyal Customer 40.0 Business Travel Eco 2361 0.0 0.0 0 NaN Green Car NaN 5.0 5.0 2.0 2.0 2.0 2.0 NaN NaN NaN 3.0 3.0 2.0
88765 98888766 Female Disloyal Customer 49.0 Business Travel Eco 1788 22.0 18.0 0 NaN Green Car NaN 3.0 4.0 2.0 3.0 1.0 2.0 NaN NaN NaN 2.0 3.0 2.0
90800 98890801 Male Loyal Customer 37.0 Business Travel Eco 1528 76.0 56.0 0 NaN Ordinary NaN 5.0 2.0 3.0 3.0 3.0 3.0 NaN 5.0 3.0 3.0 4.0 3.0
93674 98893675 Male Loyal Customer 48.0 Business Travel Eco 1315 0.0 0.0 1 NaN Green Car NaN 3.0 3.0 4.0 4.0 4.0 4.0 2.0 3.0 4.0 1.0 2.0 4.0

Observations:

  1. Quite normal distribution
  2. 61 null values, 0.06 percent of data is missing.
  3. No observable pattern in the data with missing values, we can substitute the missing values with the mode value which is ‘Acceptable’

Onboard Wifi Service¶

In [ ]:
labeled_countplot(train,'Onboard_Wifi_Service', perc = True, order = True)
Number of null values:  30

Observations:

  1. Data is skewed with more people satisfied with onboard wifi services.
  2. We have also confirmed earlier that when onboard wifi services have null values, the platform location will have null values.
  3. This could be explained that the location of the shinkansen station is at a low developed region with poor connectivity. As such, the location of the station is harder to locate and access.
  4. Substitute null values with lowest value, which is “extremely poor”

Onboard Entertainment¶

In [ ]:
labeled_countplot(train,'Onboard_Entertainment', perc = True, order = True)
Number of null values:  18
In [ ]:
train.loc[train['Onboard_Entertainment'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
2215 98802216 Male Loyal Customer 16.0 Personal Travel Eco 2977 0.0 0.0 0 2.0 Green Car NaN NaN NaN NaN NaN NaN 4.0 5.0 1.0 5.0 1.0 3.0 4.0
4477 98804478 Female Loyal Customer 69.0 Personal Travel Eco 881 4.0 18.0 0 3.0 Green Car NaN NaN NaN NaN NaN NaN 2.0 2.0 3.0 3.0 3.0 2.0 5.0
8775 98808776 Female Loyal Customer 25.0 Business Travel Business 2815 0.0 0.0 1 5.0 Ordinary NaN NaN NaN NaN NaN NaN 4.0 NaN 2.0 5.0 5.0 4.0 4.0
17607 98817608 Male Disloyal Customer 37.0 Business Travel Business 2007 4.0 16.0 0 1.0 Ordinary NaN NaN NaN NaN NaN NaN 5.0 5.0 3.0 5.0 5.0 4.0 5.0
22867 98822868 Female Loyal Customer 47.0 Personal Travel Eco 406 0.0 0.0 1 3.0 Ordinary NaN NaN NaN NaN NaN NaN 4.0 3.0 3.0 4.0 4.0 5.0 4.0
35373 98835374 Male Loyal Customer 47.0 Business Travel Business 738 13.0 1.0 1 3.0 Green Car NaN NaN NaN NaN NaN NaN 5.0 5.0 5.0 5.0 5.0 5.0 3.0
41688 98841689 Male Loyal Customer 52.0 Business Travel Business 3535 0.0 0.0 0 3.0 Green Car NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 2.0 3.0 2.0
47006 98847007 Female Disloyal Customer 25.0 Business Travel Eco 1098 0.0 0.0 0 1.0 Ordinary NaN NaN NaN NaN NaN NaN 3.0 5.0 3.0 NaN NaN 2.0 3.0
53010 98853011 Female Loyal Customer 67.0 Personal Travel Eco 1140 50.0 46.0 1 4.0 Ordinary NaN NaN NaN NaN NaN NaN 2.0 2.0 4.0 2.0 4.0 2.0 4.0
53500 98853501 Male Loyal Customer 47.0 Business Travel Business 1990 0.0 3.0 1 5.0 Green Car NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 2.0 3.0 5.0
59270 98859271 Male Loyal Customer 22.0 Personal Travel Eco 2220 39.0 41.0 0 3.0 Ordinary NaN NaN NaN NaN NaN NaN 1.0 5.0 2.0 5.0 5.0 5.0 1.0
62201 98862202 Female NaN 26.0 Business Travel Eco 2074 2.0 0.0 1 0.0 Ordinary NaN NaN NaN NaN NaN NaN 3.0 5.0 2.0 4.0 4.0 3.0 3.0
69915 98869916 Male Loyal Customer 46.0 Business Travel Business 366 8.0 0.0 1 4.0 Ordinary NaN NaN NaN NaN NaN NaN 5.0 5.0 5.0 5.0 3.0 5.0 5.0
70266 98870267 Female Loyal Customer 66.0 Business Travel Business 3207 0.0 0.0 1 0.0 Green Car NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 5.0 3.0 5.0
76005 98876006 Male Loyal Customer 68.0 Personal Travel Eco 2246 3.0 0.0 0 2.0 Green Car NaN NaN NaN NaN NaN NaN 2.0 4.0 4.0 2.0 1.0 1.0 2.0
85184 98885185 Female Loyal Customer 48.0 Personal Travel Business 2926 0.0 0.0 1 5.0 Green Car NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 2.0 4.0 1.0
89394 98889395 Female Loyal Customer 49.0 Personal Travel Eco 210 0.0 0.0 1 4.0 Ordinary NaN NaN NaN NaN NaN NaN 1.0 4.0 3.0 2.0 4.0 3.0 1.0
91055 98891056 Male Loyal Customer 42.0 Business Travel Eco 2272 0.0 0.0 0 2.0 Green Car NaN NaN NaN NaN NaN NaN 2.0 4.0 2.0 3.0 1.0 3.0 2.0
In [ ]:
#Studying the distribution of rating when wifi service is extremely poor
badwifi = train.loc[train['Onboard_Wifi_Service'] == 0]
sns.countplot(badwifi, x = 'Onboard_Entertainment');

Observations:

  1. The median rating is “Good”
  2. More people are satisfied with the onboard entertainment
  3. There are 18 null values
  4. It is found that whenever onboard entertainment is null, the onboard wifi service is null too. This is indicative of the wifi availability affecting the onboard entertainment.
  5. It can be seen that there are more “acceptable” and “poor” ratings when the wifi service is rated poor.
  6. The null values can be substituted with “Acceptable” as it is the more common rating of the two.

Online Support¶

In [ ]:
labeled_countplot(train,'Online_Support', perc = True, order = True)
Number of null values:  91
In [ ]:
missing_os = train.loc[train['Online_Support'].isnull() == True]
missing_os
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
13 98800014 Female Loyal Customer 47.0 Personal Travel Eco 1100 20.0 34.0 0 4.0 Ordinary 4.0 4.0 3.0 4.0 5.0 NaN NaN 3.0 4.0 NaN NaN 3.0 4.0
1708 98801709 Male Loyal Customer 15.0 Personal Travel Eco 3443 27.0 37.0 0 3.0 Green Car 5.0 3.0 2.0 3.0 3.0 NaN NaN NaN NaN 5.0 3.0 3.0 3.0
2215 98802216 Male Loyal Customer 16.0 Personal Travel Eco 2977 0.0 0.0 0 2.0 Green Car NaN NaN NaN NaN NaN NaN 4.0 5.0 1.0 5.0 1.0 3.0 4.0
2821 98802822 Male Loyal Customer 30.0 NaN Eco 1018 21.0 11.0 1 5.0 Green Car 1.0 3.0 3.0 5.0 5.0 NaN NaN 2.0 3.0 NaN NaN 5.0 5.0
3171 98803172 Female Loyal Customer 50.0 Business Travel Business 3843 0.0 0.0 1 5.0 Green Car 5.0 1.0 5.0 2.0 4.0 NaN NaN NaN NaN 4.0 5.0 4.0 5.0
3249 98803250 Male Loyal Customer 43.0 Business Travel Business 807 27.0 13.0 0 3.0 Ordinary NaN 1.0 1.0 3.0 3.0 NaN NaN 3.0 3.0 NaN NaN 3.0 4.0
4477 98804478 Female Loyal Customer 69.0 Personal Travel Eco 881 4.0 18.0 0 3.0 Green Car NaN NaN NaN NaN NaN NaN 2.0 2.0 3.0 3.0 3.0 2.0 5.0
5912 98805913 Male NaN 60.0 Business Travel Business 2931 0.0 7.0 0 4.0 Ordinary NaN 3.0 3.0 2.0 3.0 NaN NaN 4.0 4.0 NaN NaN 4.0 3.0
8775 98808776 Female Loyal Customer 25.0 Business Travel Business 2815 0.0 0.0 1 5.0 Ordinary NaN NaN NaN NaN NaN NaN 4.0 NaN 2.0 5.0 5.0 4.0 4.0
8904 98808905 Male Loyal Customer 57.0 Business Travel Business 601 4.0 0.0 1 5.0 Ordinary 5.0 2.0 5.0 4.0 5.0 NaN NaN NaN NaN 5.0 4.0 5.0 5.0
10656 98810657 Male Loyal Customer 10.0 Personal Travel Eco 2163 0.0 0.0 0 1.0 Ordinary 5.0 NaN 2.0 3.0 1.0 NaN NaN 5.0 3.0 NaN NaN 5.0 3.0
12755 98812756 Female Loyal Customer 47.0 NaN Eco 3113 1592.0 1584.0 0 2.0 Ordinary 2.0 2.0 3.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 3.0 2.0
15119 98815120 Female Loyal Customer 53.0 NaN Business 1646 0.0 0.0 1 5.0 Green Car 5.0 NaN 5.0 3.0 3.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
16347 98816348 Male Disloyal Customer 43.0 Business Travel Business 1491 53.0 53.0 0 3.0 Green Car 2.0 2.0 3.0 4.0 3.0 NaN NaN 1.0 5.0 NaN NaN 4.0 4.0
17364 98817365 Female Loyal Customer 63.0 Personal Travel Business 442 0.0 0.0 1 0.0 Green Car 0.0 NaN 3.0 3.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 4.0
17607 98817608 Male Disloyal Customer 37.0 Business Travel Business 2007 4.0 16.0 0 1.0 Ordinary NaN NaN NaN NaN NaN NaN 5.0 5.0 3.0 5.0 5.0 4.0 5.0
22723 98822724 Male Loyal Customer 44.0 Business Travel Business 2101 0.0 6.0 1 3.0 Green Car 3.0 3.0 3.0 5.0 4.0 NaN NaN NaN NaN 2.0 5.0 2.0 5.0
22867 98822868 Female Loyal Customer 47.0 Personal Travel Eco 406 0.0 0.0 1 3.0 Ordinary NaN NaN NaN NaN NaN NaN 4.0 3.0 3.0 4.0 4.0 5.0 4.0
24017 98824018 Male Loyal Customer 28.0 Business Travel Business 4115 29.0 0.0 1 5.0 Ordinary NaN 5.0 5.0 2.0 2.0 NaN NaN 5.0 3.0 NaN NaN 3.0 2.0
24928 98824929 Male Loyal Customer 55.0 NaN Business 2404 0.0 0.0 0 4.0 Green Car 1.0 1.0 1.0 2.0 4.0 NaN NaN NaN 4.0 NaN NaN 4.0 4.0
26664 98826665 Male Loyal Customer 43.0 Business Travel Business 1022 0.0 29.0 1 1.0 Ordinary 3.0 1.0 1.0 4.0 4.0 NaN NaN NaN NaN 4.0 4.0 4.0 4.0
27784 98827785 Female Disloyal Customer 25.0 Business Travel Business 976 4.0 0.0 0 4.0 Green Car 0.0 4.0 2.0 3.0 4.0 NaN NaN NaN NaN 5.0 3.0 4.0 3.0
28146 98828147 Female Loyal Customer 37.0 Personal Travel Eco 2395 0.0 0.0 1 2.0 Ordinary 2.0 2.0 2.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
30082 98830083 Female Disloyal Customer 24.0 Business Travel Eco 1975 3.0 0.0 1 5.0 Ordinary 0.0 NaN 3.0 3.0 0.0 NaN NaN NaN NaN 5.0 4.0 5.0 3.0
30426 98830427 Female Loyal Customer 52.0 Business Travel Business 2423 0.0 0.0 1 5.0 Ordinary 5.0 5.0 5.0 2.0 5.0 NaN NaN NaN NaN 5.0 5.0 5.0 4.0
32913 98832914 Female Loyal Customer 53.0 Business Travel Business 1566 16.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 4.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
33809 98833810 Male Loyal Customer 47.0 NaN Eco 1577 0.0 4.0 0 2.0 Green Car 5.0 2.0 2.0 5.0 2.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
34365 98834366 Female Loyal Customer 27.0 Personal Travel Eco 2121 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 5.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 4.0
35112 98835113 Female Loyal Customer 37.0 Personal Travel Eco 1712 0.0 20.0 1 5.0 Green Car 5.0 3.0 5.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
35214 98835215 Male Loyal Customer 44.0 Business Travel Business 2476 6.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 4.0 5.0 NaN NaN 3.0 3.0 NaN NaN 3.0 5.0
35373 98835374 Male Loyal Customer 47.0 Business Travel Business 738 13.0 1.0 1 3.0 Green Car NaN NaN NaN NaN NaN NaN 5.0 5.0 5.0 5.0 5.0 5.0 3.0
35928 98835929 Female Disloyal Customer 21.0 Business Travel Business 1543 0.0 0.0 1 4.0 Green Car 5.0 4.0 3.0 3.0 4.0 NaN NaN 1.0 5.0 NaN NaN 1.0 3.0
37552 98837553 Female Loyal Customer 51.0 Business Travel Business 1555 0.0 0.0 1 3.0 Green Car 3.0 NaN 3.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
38172 98838173 Female Loyal Customer 53.0 Business Travel Business 449 0.0 3.0 1 2.0 Green Car 5.0 NaN 2.0 3.0 1.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
38728 98838729 Female NaN 36.0 Business Travel Business 332 60.0 58.0 1 2.0 Ordinary 2.0 2.0 2.0 4.0 3.0 NaN NaN NaN NaN 4.0 1.0 4.0 1.0
39711 98839712 Male Loyal Customer 43.0 Business Travel Business 3936 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 2.0 4.0 NaN NaN NaN NaN 5.0 3.0 5.0 5.0
40893 98840894 Female Disloyal Customer 27.0 Business Travel Eco 1280 19.0 8.0 0 2.0 Ordinary 3.0 2.0 3.0 3.0 2.0 NaN NaN 3.0 2.0 NaN NaN 3.0 3.0
41688 98841689 Male Loyal Customer 52.0 Business Travel Business 3535 0.0 0.0 0 3.0 Green Car NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 2.0 3.0 2.0
43644 98843645 Female Loyal Customer 25.0 Personal Travel Eco 2109 1.0 2.0 1 1.0 Green Car 1.0 4.0 1.0 5.0 5.0 NaN NaN NaN NaN 1.0 2.0 4.0 5.0
43836 98843837 Female Loyal Customer 13.0 Personal Travel Eco 1880 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
44050 98844051 Male Loyal Customer 19.0 Personal Travel Eco 1410 0.0 0.0 1 4.0 Ordinary 2.0 1.0 4.0 5.0 1.0 NaN NaN NaN NaN 4.0 4.0 3.0 5.0
45055 98845056 Female Loyal Customer 30.0 NaN Business 4707 1.0 20.0 0 2.0 Ordinary 4.0 4.0 4.0 2.0 2.0 NaN NaN 3.0 1.0 NaN NaN 3.0 2.0
46703 98846704 Female Disloyal Customer 24.0 NaN Eco 2008 130.0 127.0 0 3.0 Green Car 2.0 3.0 3.0 3.0 3.0 NaN NaN 3.0 4.0 NaN NaN 3.0 3.0
47006 98847007 Female Disloyal Customer 25.0 Business Travel Eco 1098 0.0 0.0 0 1.0 Ordinary NaN NaN NaN NaN NaN NaN 3.0 5.0 3.0 NaN NaN 2.0 3.0
47952 98847953 Female Disloyal Customer 50.0 Business Travel Eco 2062 0.0 0.0 0 3.0 Ordinary 0.0 3.0 4.0 4.0 3.0 NaN NaN NaN NaN 3.0 4.0 4.0 4.0
50305 98850306 Male Loyal Customer 23.0 Personal Travel Eco 2534 0.0 0.0 0 2.0 Green Car NaN 2.0 3.0 4.0 2.0 NaN NaN 3.0 4.0 NaN NaN 3.0 4.0
50701 98850702 Male Loyal Customer 63.0 Personal Travel Eco 2443 1.0 0.0 0 3.0 Green Car 5.0 4.0 5.0 3.0 4.0 NaN NaN 5.0 4.0 NaN NaN 4.0 3.0
51384 98851385 Female Loyal Customer 51.0 Business Travel Business 4037 8.0 0.0 1 1.0 Green Car 1.0 1.0 1.0 4.0 5.0 NaN NaN NaN NaN 4.0 3.0 4.0 4.0
51963 98851964 Male Loyal Customer 55.0 Business Travel Eco 2255 0.0 0.0 1 4.0 Ordinary NaN 5.0 5.0 4.0 4.0 NaN NaN 5.0 1.0 NaN NaN 4.0 4.0
53010 98853011 Female Loyal Customer 67.0 Personal Travel Eco 1140 50.0 46.0 1 4.0 Ordinary NaN NaN NaN NaN NaN NaN 2.0 2.0 4.0 2.0 4.0 2.0 4.0
53370 98853371 Male Loyal Customer 40.0 Personal Travel Eco 2251 0.0 8.0 0 1.0 Green Car 1.0 1.0 1.0 1.0 1.0 NaN NaN 3.0 3.0 NaN NaN 1.0 1.0
53500 98853501 Male Loyal Customer 47.0 Business Travel Business 1990 0.0 3.0 1 5.0 Green Car NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 2.0 3.0 5.0
57283 98857284 Female Loyal Customer 41.0 Business Travel Business 1952 46.0 44.0 1 1.0 Green Car 1.0 1.0 1.0 5.0 4.0 NaN NaN NaN NaN 5.0 5.0 5.0 4.0
58878 98858879 Female Loyal Customer 52.0 Personal Travel Eco 1789 2.0 0.0 1 4.0 Ordinary 5.0 NaN 4.0 5.0 5.0 NaN NaN 2.0 5.0 NaN NaN 2.0 3.0
58982 98858983 Male Disloyal Customer 7.0 Business Travel Eco 2016 22.0 11.0 0 4.0 Green Car 2.0 4.0 4.0 5.0 4.0 NaN NaN 4.0 2.0 NaN NaN 4.0 5.0
59270 98859271 Male Loyal Customer 22.0 Personal Travel Eco 2220 39.0 41.0 0 3.0 Ordinary NaN NaN NaN NaN NaN NaN 1.0 5.0 2.0 5.0 5.0 5.0 1.0
59864 98859865 Male Loyal Customer 17.0 Business Travel Business 1242 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 5.0 5.0 NaN NaN NaN NaN 5.0 2.0 2.0 5.0
59980 98859981 Male NaN 41.0 Business Travel Business 3804 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 1.0 1.0 NaN NaN NaN NaN 4.0 3.0 4.0 3.0
61021 98861022 Male Loyal Customer 39.0 Business Travel Business 1932 14.0 33.0 1 3.0 Green Car 3.0 NaN 3.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
61222 98861223 Female Loyal Customer 41.0 Business Travel Eco 622 6.0 2.0 0 2.0 Ordinary 1.0 2.0 1.0 4.0 2.0 NaN NaN NaN NaN 2.0 2.0 2.0 2.0
62201 98862202 Female NaN 26.0 Business Travel Eco 2074 2.0 0.0 1 0.0 Ordinary NaN NaN NaN NaN NaN NaN 3.0 5.0 2.0 4.0 4.0 3.0 3.0
66977 98866978 Female Loyal Customer 53.0 Business Travel Business 922 6.0 5.0 1 3.0 Green Car 3.0 3.0 3.0 1.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 2.0
68315 98868316 Female Disloyal Customer 22.0 Business Travel Eco 2237 9.0 0.0 1 2.0 Ordinary 3.0 3.0 3.0 4.0 3.0 NaN NaN 3.0 3.0 NaN NaN 3.0 4.0
68326 98868327 Male Loyal Customer 60.0 Business Travel Eco 2304 42.0 39.0 0 1.0 Ordinary 3.0 3.0 3.0 1.0 1.0 NaN NaN 1.0 2.0 NaN NaN 4.0 1.0
68796 98868797 Female Loyal Customer 54.0 Personal Travel Eco 245 0.0 6.0 1 5.0 Green Car 5.0 5.0 5.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
69915 98869916 Male Loyal Customer 46.0 Business Travel Business 366 8.0 0.0 1 4.0 Ordinary NaN NaN NaN NaN NaN NaN 5.0 5.0 5.0 5.0 3.0 5.0 5.0
70266 98870267 Female Loyal Customer 66.0 Business Travel Business 3207 0.0 0.0 1 0.0 Green Car NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 5.0 3.0 5.0
70627 98870628 Female Loyal Customer 49.0 Business Travel Business 2522 0.0 0.0 1 5.0 Ordinary 5.0 NaN 5.0 5.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 3.0
72549 98872550 Female Loyal Customer 39.0 Business Travel Business 629 0.0 0.0 1 5.0 Green Car 5.0 5.0 5.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
73825 98873826 Male Loyal Customer 8.0 Business Travel Business 2988 2.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 2.0 2.0 NaN NaN 5.0 5.0 NaN NaN 4.0 2.0
76005 98876006 Male Loyal Customer 68.0 Personal Travel Eco 2246 3.0 0.0 0 2.0 Green Car NaN NaN NaN NaN NaN NaN 2.0 4.0 4.0 2.0 1.0 1.0 2.0
80131 98880132 Male Loyal Customer 52.0 Business Travel Business 3161 0.0 2.0 1 5.0 Green Car 5.0 5.0 5.0 3.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81304 98881305 Male Loyal Customer 50.0 Business Travel Eco 1707 0.0 0.0 1 4.0 Green Car 2.0 2.0 2.0 4.0 4.0 NaN NaN 2.0 5.0 NaN NaN 1.0 4.0
81664 98881665 Male Loyal Customer 60.0 Business Travel Business 2212 0.0 0.0 1 1.0 Green Car 1.0 5.0 1.0 5.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81800 98881801 Male Loyal Customer 58.0 Business Travel Business 2009 0.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 3.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81996 98881997 Male Loyal Customer 34.0 Business Travel Business 3210 80.0 73.0 1 1.0 Ordinary 1.0 4.0 1.0 1.0 2.0 NaN NaN 5.0 5.0 NaN NaN 5.0 2.0
82744 98882745 Female NaN 18.0 Business Travel Eco 1625 0.0 0.0 1 5.0 Green Car 3.0 5.0 2.0 4.0 5.0 NaN NaN NaN NaN 4.0 4.0 5.0 4.0
83052 98883053 Female Loyal Customer 42.0 Personal Travel Eco 890 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 4.0 4.0 NaN NaN 5.0 3.0 NaN NaN 4.0 4.0
83881 98883882 Male Loyal Customer 11.0 Personal Travel Eco 2883 0.0 0.0 0 2.0 Ordinary 4.0 2.0 1.0 1.0 2.0 NaN NaN 5.0 3.0 NaN NaN 4.0 1.0
84434 98884435 Male Loyal Customer 40.0 Business Travel Business 479 50.0 49.0 0 1.0 Green Car 3.0 3.0 3.0 2.0 3.0 NaN NaN NaN NaN 1.0 3.0 1.0 3.0
84798 98884799 Female Loyal Customer 57.0 Business Travel Business 3931 58.0 64.0 1 5.0 Ordinary 5.0 NaN 5.0 5.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
85184 98885185 Female Loyal Customer 48.0 Personal Travel Business 2926 0.0 0.0 1 5.0 Green Car NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 2.0 4.0 1.0
85216 98885217 Female Loyal Customer 25.0 Business Travel Business 2894 0.0 0.0 1 2.0 Green Car 2.0 2.0 2.0 4.0 4.0 NaN NaN NaN NaN 3.0 4.0 4.0 4.0
86180 98886181 Male Loyal Customer 44.0 Business Travel Eco 1574 0.0 0.0 1 5.0 Green Car NaN 2.0 2.0 5.0 5.0 NaN NaN 1.0 5.0 NaN NaN 1.0 5.0
87129 98887130 Male Loyal Customer 34.0 Business Travel Business 3925 5.0 33.0 1 5.0 Ordinary 5.0 5.0 5.0 5.0 3.0 NaN NaN NaN 5.0 NaN NaN 5.0 1.0
87755 98887756 Female Loyal Customer 39.0 Personal Travel Eco 1630 34.0 24.0 1 5.0 Ordinary 1.0 1.0 1.0 5.0 5.0 NaN NaN NaN NaN 1.0 5.0 4.0 5.0
89394 98889395 Female Loyal Customer 49.0 Personal Travel Eco 210 0.0 0.0 1 4.0 Ordinary NaN NaN NaN NaN NaN NaN 1.0 4.0 3.0 2.0 4.0 3.0 1.0
89951 98889952 Female Loyal Customer 37.0 Business Travel Business 2810 3.0 1.0 1 4.0 Ordinary 3.0 4.0 4.0 2.0 3.0 NaN NaN NaN NaN 4.0 3.0 4.0 4.0
90252 98890253 Male Loyal Customer 50.0 Personal Travel Eco 2132 7.0 39.0 0 3.0 Green Car 4.0 3.0 3.0 5.0 3.0 NaN NaN NaN NaN 4.0 3.0 5.0 5.0
91055 98891056 Male Loyal Customer 42.0 Business Travel Eco 2272 0.0 0.0 0 2.0 Green Car NaN NaN NaN NaN NaN NaN 2.0 4.0 2.0 3.0 1.0 3.0 2.0
91805 98891806 Female Loyal Customer 56.0 Personal Travel Business 984 0.0 0.0 1 0.0 Ordinary 3.0 1.0 3.0 1.0 1.0 NaN NaN NaN NaN 5.0 5.0 1.0 2.0

Ease of Online Booking¶

In [ ]:
labeled_countplot(train,'Ease_of_Online_Booking', perc = True, order = True)
Number of null values:  73
In [ ]:
train.loc[train['Ease_of_Online_Booking'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
13 98800014 Female Loyal Customer 47.0 Personal Travel Eco 1100 20.0 34.0 0 4.0 Ordinary 4.0 4.0 3.0 4.0 5.0 NaN NaN 3.0 4.0 NaN NaN 3.0 4.0
1708 98801709 Male Loyal Customer 15.0 Personal Travel Eco 3443 27.0 37.0 0 3.0 Green Car 5.0 3.0 2.0 3.0 3.0 NaN NaN NaN NaN 5.0 3.0 3.0 3.0
2821 98802822 Male Loyal Customer 30.0 NaN Eco 1018 21.0 11.0 1 5.0 Green Car 1.0 3.0 3.0 5.0 5.0 NaN NaN 2.0 3.0 NaN NaN 5.0 5.0
3171 98803172 Female Loyal Customer 50.0 Business Travel Business 3843 0.0 0.0 1 5.0 Green Car 5.0 1.0 5.0 2.0 4.0 NaN NaN NaN NaN 4.0 5.0 4.0 5.0
3249 98803250 Male Loyal Customer 43.0 Business Travel Business 807 27.0 13.0 0 3.0 Ordinary NaN 1.0 1.0 3.0 3.0 NaN NaN 3.0 3.0 NaN NaN 3.0 4.0
5912 98805913 Male NaN 60.0 Business Travel Business 2931 0.0 7.0 0 4.0 Ordinary NaN 3.0 3.0 2.0 3.0 NaN NaN 4.0 4.0 NaN NaN 4.0 3.0
8904 98808905 Male Loyal Customer 57.0 Business Travel Business 601 4.0 0.0 1 5.0 Ordinary 5.0 2.0 5.0 4.0 5.0 NaN NaN NaN NaN 5.0 4.0 5.0 5.0
10656 98810657 Male Loyal Customer 10.0 Personal Travel Eco 2163 0.0 0.0 0 1.0 Ordinary 5.0 NaN 2.0 3.0 1.0 NaN NaN 5.0 3.0 NaN NaN 5.0 3.0
12755 98812756 Female Loyal Customer 47.0 NaN Eco 3113 1592.0 1584.0 0 2.0 Ordinary 2.0 2.0 3.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 3.0 2.0
15119 98815120 Female Loyal Customer 53.0 NaN Business 1646 0.0 0.0 1 5.0 Green Car 5.0 NaN 5.0 3.0 3.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
16347 98816348 Male Disloyal Customer 43.0 Business Travel Business 1491 53.0 53.0 0 3.0 Green Car 2.0 2.0 3.0 4.0 3.0 NaN NaN 1.0 5.0 NaN NaN 4.0 4.0
17364 98817365 Female Loyal Customer 63.0 Personal Travel Business 442 0.0 0.0 1 0.0 Green Car 0.0 NaN 3.0 3.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 4.0
22723 98822724 Male Loyal Customer 44.0 Business Travel Business 2101 0.0 6.0 1 3.0 Green Car 3.0 3.0 3.0 5.0 4.0 NaN NaN NaN NaN 2.0 5.0 2.0 5.0
24017 98824018 Male Loyal Customer 28.0 Business Travel Business 4115 29.0 0.0 1 5.0 Ordinary NaN 5.0 5.0 2.0 2.0 NaN NaN 5.0 3.0 NaN NaN 3.0 2.0
24928 98824929 Male Loyal Customer 55.0 NaN Business 2404 0.0 0.0 0 4.0 Green Car 1.0 1.0 1.0 2.0 4.0 NaN NaN NaN 4.0 NaN NaN 4.0 4.0
26664 98826665 Male Loyal Customer 43.0 Business Travel Business 1022 0.0 29.0 1 1.0 Ordinary 3.0 1.0 1.0 4.0 4.0 NaN NaN NaN NaN 4.0 4.0 4.0 4.0
27784 98827785 Female Disloyal Customer 25.0 Business Travel Business 976 4.0 0.0 0 4.0 Green Car 0.0 4.0 2.0 3.0 4.0 NaN NaN NaN NaN 5.0 3.0 4.0 3.0
28146 98828147 Female Loyal Customer 37.0 Personal Travel Eco 2395 0.0 0.0 1 2.0 Ordinary 2.0 2.0 2.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
30082 98830083 Female Disloyal Customer 24.0 Business Travel Eco 1975 3.0 0.0 1 5.0 Ordinary 0.0 NaN 3.0 3.0 0.0 NaN NaN NaN NaN 5.0 4.0 5.0 3.0
30426 98830427 Female Loyal Customer 52.0 Business Travel Business 2423 0.0 0.0 1 5.0 Ordinary 5.0 5.0 5.0 2.0 5.0 NaN NaN NaN NaN 5.0 5.0 5.0 4.0
32913 98832914 Female Loyal Customer 53.0 Business Travel Business 1566 16.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 4.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
33809 98833810 Male Loyal Customer 47.0 NaN Eco 1577 0.0 4.0 0 2.0 Green Car 5.0 2.0 2.0 5.0 2.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
34365 98834366 Female Loyal Customer 27.0 Personal Travel Eco 2121 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 5.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 4.0
35112 98835113 Female Loyal Customer 37.0 Personal Travel Eco 1712 0.0 20.0 1 5.0 Green Car 5.0 3.0 5.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
35214 98835215 Male Loyal Customer 44.0 Business Travel Business 2476 6.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 4.0 5.0 NaN NaN 3.0 3.0 NaN NaN 3.0 5.0
35928 98835929 Female Disloyal Customer 21.0 Business Travel Business 1543 0.0 0.0 1 4.0 Green Car 5.0 4.0 3.0 3.0 4.0 NaN NaN 1.0 5.0 NaN NaN 1.0 3.0
37552 98837553 Female Loyal Customer 51.0 Business Travel Business 1555 0.0 0.0 1 3.0 Green Car 3.0 NaN 3.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
38172 98838173 Female Loyal Customer 53.0 Business Travel Business 449 0.0 3.0 1 2.0 Green Car 5.0 NaN 2.0 3.0 1.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
38728 98838729 Female NaN 36.0 Business Travel Business 332 60.0 58.0 1 2.0 Ordinary 2.0 2.0 2.0 4.0 3.0 NaN NaN NaN NaN 4.0 1.0 4.0 1.0
39711 98839712 Male Loyal Customer 43.0 Business Travel Business 3936 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 2.0 4.0 NaN NaN NaN NaN 5.0 3.0 5.0 5.0
40893 98840894 Female Disloyal Customer 27.0 Business Travel Eco 1280 19.0 8.0 0 2.0 Ordinary 3.0 2.0 3.0 3.0 2.0 NaN NaN 3.0 2.0 NaN NaN 3.0 3.0
43644 98843645 Female Loyal Customer 25.0 Personal Travel Eco 2109 1.0 2.0 1 1.0 Green Car 1.0 4.0 1.0 5.0 5.0 NaN NaN NaN NaN 1.0 2.0 4.0 5.0
43836 98843837 Female Loyal Customer 13.0 Personal Travel Eco 1880 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
44050 98844051 Male Loyal Customer 19.0 Personal Travel Eco 1410 0.0 0.0 1 4.0 Ordinary 2.0 1.0 4.0 5.0 1.0 NaN NaN NaN NaN 4.0 4.0 3.0 5.0
45055 98845056 Female Loyal Customer 30.0 NaN Business 4707 1.0 20.0 0 2.0 Ordinary 4.0 4.0 4.0 2.0 2.0 NaN NaN 3.0 1.0 NaN NaN 3.0 2.0
46703 98846704 Female Disloyal Customer 24.0 NaN Eco 2008 130.0 127.0 0 3.0 Green Car 2.0 3.0 3.0 3.0 3.0 NaN NaN 3.0 4.0 NaN NaN 3.0 3.0
47952 98847953 Female Disloyal Customer 50.0 Business Travel Eco 2062 0.0 0.0 0 3.0 Ordinary 0.0 3.0 4.0 4.0 3.0 NaN NaN NaN NaN 3.0 4.0 4.0 4.0
50305 98850306 Male Loyal Customer 23.0 Personal Travel Eco 2534 0.0 0.0 0 2.0 Green Car NaN 2.0 3.0 4.0 2.0 NaN NaN 3.0 4.0 NaN NaN 3.0 4.0
50701 98850702 Male Loyal Customer 63.0 Personal Travel Eco 2443 1.0 0.0 0 3.0 Green Car 5.0 4.0 5.0 3.0 4.0 NaN NaN 5.0 4.0 NaN NaN 4.0 3.0
51384 98851385 Female Loyal Customer 51.0 Business Travel Business 4037 8.0 0.0 1 1.0 Green Car 1.0 1.0 1.0 4.0 5.0 NaN NaN NaN NaN 4.0 3.0 4.0 4.0
51963 98851964 Male Loyal Customer 55.0 Business Travel Eco 2255 0.0 0.0 1 4.0 Ordinary NaN 5.0 5.0 4.0 4.0 NaN NaN 5.0 1.0 NaN NaN 4.0 4.0
53370 98853371 Male Loyal Customer 40.0 Personal Travel Eco 2251 0.0 8.0 0 1.0 Green Car 1.0 1.0 1.0 1.0 1.0 NaN NaN 3.0 3.0 NaN NaN 1.0 1.0
57283 98857284 Female Loyal Customer 41.0 Business Travel Business 1952 46.0 44.0 1 1.0 Green Car 1.0 1.0 1.0 5.0 4.0 NaN NaN NaN NaN 5.0 5.0 5.0 4.0
58878 98858879 Female Loyal Customer 52.0 Personal Travel Eco 1789 2.0 0.0 1 4.0 Ordinary 5.0 NaN 4.0 5.0 5.0 NaN NaN 2.0 5.0 NaN NaN 2.0 3.0
58982 98858983 Male Disloyal Customer 7.0 Business Travel Eco 2016 22.0 11.0 0 4.0 Green Car 2.0 4.0 4.0 5.0 4.0 NaN NaN 4.0 2.0 NaN NaN 4.0 5.0
59864 98859865 Male Loyal Customer 17.0 Business Travel Business 1242 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 5.0 5.0 NaN NaN NaN NaN 5.0 2.0 2.0 5.0
59980 98859981 Male NaN 41.0 Business Travel Business 3804 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 1.0 1.0 NaN NaN NaN NaN 4.0 3.0 4.0 3.0
61021 98861022 Male Loyal Customer 39.0 Business Travel Business 1932 14.0 33.0 1 3.0 Green Car 3.0 NaN 3.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
61222 98861223 Female Loyal Customer 41.0 Business Travel Eco 622 6.0 2.0 0 2.0 Ordinary 1.0 2.0 1.0 4.0 2.0 NaN NaN NaN NaN 2.0 2.0 2.0 2.0
66977 98866978 Female Loyal Customer 53.0 Business Travel Business 922 6.0 5.0 1 3.0 Green Car 3.0 3.0 3.0 1.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 2.0
68315 98868316 Female Disloyal Customer 22.0 Business Travel Eco 2237 9.0 0.0 1 2.0 Ordinary 3.0 3.0 3.0 4.0 3.0 NaN NaN 3.0 3.0 NaN NaN 3.0 4.0
68326 98868327 Male Loyal Customer 60.0 Business Travel Eco 2304 42.0 39.0 0 1.0 Ordinary 3.0 3.0 3.0 1.0 1.0 NaN NaN 1.0 2.0 NaN NaN 4.0 1.0
68796 98868797 Female Loyal Customer 54.0 Personal Travel Eco 245 0.0 6.0 1 5.0 Green Car 5.0 5.0 5.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
70627 98870628 Female Loyal Customer 49.0 Business Travel Business 2522 0.0 0.0 1 5.0 Ordinary 5.0 NaN 5.0 5.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 3.0
72549 98872550 Female Loyal Customer 39.0 Business Travel Business 629 0.0 0.0 1 5.0 Green Car 5.0 5.0 5.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
73825 98873826 Male Loyal Customer 8.0 Business Travel Business 2988 2.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 2.0 2.0 NaN NaN 5.0 5.0 NaN NaN 4.0 2.0
80131 98880132 Male Loyal Customer 52.0 Business Travel Business 3161 0.0 2.0 1 5.0 Green Car 5.0 5.0 5.0 3.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81304 98881305 Male Loyal Customer 50.0 Business Travel Eco 1707 0.0 0.0 1 4.0 Green Car 2.0 2.0 2.0 4.0 4.0 NaN NaN 2.0 5.0 NaN NaN 1.0 4.0
81664 98881665 Male Loyal Customer 60.0 Business Travel Business 2212 0.0 0.0 1 1.0 Green Car 1.0 5.0 1.0 5.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81800 98881801 Male Loyal Customer 58.0 Business Travel Business 2009 0.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 3.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81996 98881997 Male Loyal Customer 34.0 Business Travel Business 3210 80.0 73.0 1 1.0 Ordinary 1.0 4.0 1.0 1.0 2.0 NaN NaN 5.0 5.0 NaN NaN 5.0 2.0
82744 98882745 Female NaN 18.0 Business Travel Eco 1625 0.0 0.0 1 5.0 Green Car 3.0 5.0 2.0 4.0 5.0 NaN NaN NaN NaN 4.0 4.0 5.0 4.0
83052 98883053 Female Loyal Customer 42.0 Personal Travel Eco 890 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 4.0 4.0 NaN NaN 5.0 3.0 NaN NaN 4.0 4.0
83881 98883882 Male Loyal Customer 11.0 Personal Travel Eco 2883 0.0 0.0 0 2.0 Ordinary 4.0 2.0 1.0 1.0 2.0 NaN NaN 5.0 3.0 NaN NaN 4.0 1.0
84434 98884435 Male Loyal Customer 40.0 Business Travel Business 479 50.0 49.0 0 1.0 Green Car 3.0 3.0 3.0 2.0 3.0 NaN NaN NaN NaN 1.0 3.0 1.0 3.0
84798 98884799 Female Loyal Customer 57.0 Business Travel Business 3931 58.0 64.0 1 5.0 Ordinary 5.0 NaN 5.0 5.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
85216 98885217 Female Loyal Customer 25.0 Business Travel Business 2894 0.0 0.0 1 2.0 Green Car 2.0 2.0 2.0 4.0 4.0 NaN NaN NaN NaN 3.0 4.0 4.0 4.0
86180 98886181 Male Loyal Customer 44.0 Business Travel Eco 1574 0.0 0.0 1 5.0 Green Car NaN 2.0 2.0 5.0 5.0 NaN NaN 1.0 5.0 NaN NaN 1.0 5.0
87129 98887130 Male Loyal Customer 34.0 Business Travel Business 3925 5.0 33.0 1 5.0 Ordinary 5.0 5.0 5.0 5.0 3.0 NaN NaN NaN 5.0 NaN NaN 5.0 1.0
87755 98887756 Female Loyal Customer 39.0 Personal Travel Eco 1630 34.0 24.0 1 5.0 Ordinary 1.0 1.0 1.0 5.0 5.0 NaN NaN NaN NaN 1.0 5.0 4.0 5.0
89951 98889952 Female Loyal Customer 37.0 Business Travel Business 2810 3.0 1.0 1 4.0 Ordinary 3.0 4.0 4.0 2.0 3.0 NaN NaN NaN NaN 4.0 3.0 4.0 4.0
90252 98890253 Male Loyal Customer 50.0 Personal Travel Eco 2132 7.0 39.0 0 3.0 Green Car 4.0 3.0 3.0 5.0 3.0 NaN NaN NaN NaN 4.0 3.0 5.0 5.0
91805 98891806 Female Loyal Customer 56.0 Personal Travel Business 984 0.0 0.0 1 0.0 Ordinary 3.0 1.0 3.0 1.0 1.0 NaN NaN NaN NaN 5.0 5.0 1.0 2.0

Legroom¶

In [ ]:
labeled_countplot(train,'Legroom', perc = True, order = True)
Number of null values:  90
In [ ]:
train.loc[train['Legroom'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
1708 98801709 Male Loyal Customer 15.0 Personal Travel Eco 3443 27.0 37.0 0 3.0 Green Car 5.0 3.0 2.0 3.0 3.0 NaN NaN NaN NaN 5.0 3.0 3.0 3.0
2214 98802215 Male Loyal Customer 30.0 NaN Business 4725 86.0 77.0 1 NaN Green Car NaN 1.0 3.0 4.0 4.0 4.0 4.0 NaN NaN NaN 4.0 5.0 4.0
2666 98802667 Female Loyal Customer 39.0 Business Travel Business 1995 43.0 30.0 1 2.0 Green Car 2.0 5.0 2.0 5.0 5.0 4.0 5.0 NaN NaN NaN 5.0 5.0 3.0
3171 98803172 Female Loyal Customer 50.0 Business Travel Business 3843 0.0 0.0 1 5.0 Green Car 5.0 1.0 5.0 2.0 4.0 NaN NaN NaN NaN 4.0 5.0 4.0 5.0
4754 98804755 Male Loyal Customer 63.0 Personal Travel Eco 1380 0.0 0.0 0 NaN Ordinary NaN 2.0 3.0 5.0 2.0 1.0 5.0 NaN NaN NaN 5.0 4.0 5.0
5551 98805552 Female Disloyal Customer 24.0 Business Travel Business 2093 14.0 3.0 1 5.0 Ordinary 0.0 5.0 4.0 1.0 5.0 1.0 1.0 NaN NaN NaN 3.0 4.0 1.0
7019 98807020 Male Loyal Customer 68.0 Business Travel Business 2216 9.0 8.0 1 2.0 Green Car 2.0 2.0 2.0 4.0 5.0 5.0 4.0 NaN NaN NaN 3.0 4.0 4.0
8490 98808491 Female Loyal Customer 39.0 Business Travel Eco 1814 0.0 0.0 1 5.0 Green Car 4.0 4.0 4.0 5.0 5.0 5.0 5.0 NaN NaN NaN 1.0 5.0 5.0
8904 98808905 Male Loyal Customer 57.0 Business Travel Business 601 4.0 0.0 1 5.0 Ordinary 5.0 2.0 5.0 4.0 5.0 NaN NaN NaN NaN 5.0 4.0 5.0 5.0
8936 98808937 Female Loyal Customer 56.0 Business Travel Eco 273 0.0 0.0 1 NaN Ordinary NaN 2.0 2.0 1.0 1.0 3.0 3.0 NaN NaN NaN 4.0 3.0 4.0
10241 98810242 Female Disloyal Customer 14.0 Personal Travel Business 1966 0.0 0.0 0 3.0 Ordinary 5.0 3.0 4.0 1.0 3.0 1.0 1.0 NaN NaN NaN 4.0 4.0 1.0
11079 98811080 Female Loyal Customer 70.0 Personal Travel Eco 912 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 4.0 4.0 4.0 4.0 NaN NaN NaN 4.0 4.0 4.0
11199 98811200 Male Disloyal Customer 24.0 Business Travel Eco 1732 39.0 6.0 0 3.0 Ordinary 3.0 3.0 3.0 5.0 3.0 5.0 5.0 NaN NaN NaN 1.0 3.0 5.0
12758 98812759 Male Loyal Customer 61.0 Personal Travel Eco 1822 0.0 0.0 1 0.0 Ordinary 5.0 0.0 3.0 3.0 0.0 3.0 3.0 NaN NaN NaN 5.0 3.0 3.0
18978 98818979 Male NaN 37.0 Business Travel Eco 1575 0.0 0.0 0 1.0 Ordinary 3.0 3.0 3.0 1.0 1.0 1.0 1.0 NaN NaN NaN 4.0 3.0 1.0
19290 98819291 Male Loyal Customer 47.0 Business Travel Business 2007 42.0 31.0 1 1.0 Ordinary 1.0 1.0 1.0 5.0 5.0 4.0 2.0 NaN NaN NaN 4.0 2.0 3.0
20440 98820441 Female NaN 56.0 Business Travel Business 3676 0.0 0.0 1 2.0 Green Car 2.0 2.0 2.0 2.0 4.0 4.0 4.0 NaN NaN NaN 5.0 4.0 5.0
20664 98820665 Female Loyal Customer 57.0 Business Travel Business 3459 0.0 0.0 1 4.0 Green Car 4.0 4.0 4.0 3.0 5.0 4.0 5.0 NaN NaN NaN 5.0 5.0 5.0
22362 98822363 Male Loyal Customer 63.0 Personal Travel Eco 1209 0.0 0.0 0 3.0 Green Car 5.0 3.0 2.0 2.0 3.0 2.0 2.0 NaN NaN NaN 3.0 4.0 2.0
22723 98822724 Male Loyal Customer 44.0 Business Travel Business 2101 0.0 6.0 1 3.0 Green Car 3.0 3.0 3.0 5.0 4.0 NaN NaN NaN NaN 2.0 5.0 2.0 5.0
26364 98826365 Male Loyal Customer 46.0 Business Travel Eco 2195 0.0 21.0 0 3.0 Green Car 4.0 4.0 4.0 3.0 3.0 3.0 3.0 NaN NaN NaN 4.0 4.0 3.0
26664 98826665 Male Loyal Customer 43.0 Business Travel Business 1022 0.0 29.0 1 1.0 Ordinary 3.0 1.0 1.0 4.0 4.0 NaN NaN NaN NaN 4.0 4.0 4.0 4.0
27784 98827785 Female Disloyal Customer 25.0 Business Travel Business 976 4.0 0.0 0 4.0 Green Car 0.0 4.0 2.0 3.0 4.0 NaN NaN NaN NaN 5.0 3.0 4.0 3.0
28858 98828859 Male Disloyal Customer 26.0 Business Travel Business 1112 0.0 0.0 0 NaN Green Car NaN 0.0 3.0 4.0 0.0 4.0 4.0 NaN NaN NaN 3.0 4.0 4.0
30082 98830083 Female Disloyal Customer 24.0 Business Travel Eco 1975 3.0 0.0 1 5.0 Ordinary 0.0 NaN 3.0 3.0 0.0 NaN NaN NaN NaN 5.0 4.0 5.0 3.0
30426 98830427 Female Loyal Customer 52.0 Business Travel Business 2423 0.0 0.0 1 5.0 Ordinary 5.0 5.0 5.0 2.0 5.0 NaN NaN NaN NaN 5.0 5.0 5.0 4.0
31097 98831098 Male Loyal Customer 45.0 Business Travel Eco 1639 0.0 0.0 1 4.0 Green Car 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN 2.0 5.0 4.0
31371 98831372 Female Loyal Customer 21.0 Personal Travel Eco 2915 23.0 0.0 1 4.0 Green Car NaN 4.0 4.0 2.0 2.0 2.0 4.0 NaN NaN NaN 1.0 4.0 1.0
33114 98833115 Male Disloyal Customer 30.0 Business Travel Eco 2180 0.0 0.0 0 NaN Ordinary NaN 2.0 4.0 1.0 2.0 1.0 1.0 NaN NaN NaN 5.0 5.0 1.0
33511 98833512 Female NaN 37.0 Personal Travel Eco 2617 5.0 3.0 1 4.0 Ordinary 4.0 2.0 4.0 2.0 3.0 2.0 4.0 NaN NaN NaN 1.0 4.0 4.0
34589 98834590 Male Loyal Customer 26.0 Personal Travel Eco 2001 0.0 0.0 0 4.0 Green Car 3.0 4.0 3.0 5.0 4.0 5.0 5.0 NaN NaN NaN 3.0 1.0 5.0
36254 98836255 Male Disloyal Customer 32.0 Business Travel Eco 1706 0.0 2.0 0 NaN Ordinary NaN 3.0 2.0 1.0 3.0 1.0 1.0 NaN NaN NaN 5.0 1.0 1.0
38728 98838729 Female NaN 36.0 Business Travel Business 332 60.0 58.0 1 2.0 Ordinary 2.0 2.0 2.0 4.0 3.0 NaN NaN NaN NaN 4.0 1.0 4.0 1.0
38731 98838732 Male NaN 35.0 Business Travel Business 2403 0.0 0.0 1 5.0 Ordinary NaN 5.0 5.0 4.0 4.0 1.0 2.0 NaN NaN NaN 4.0 2.0 2.0
39389 98839390 Female Loyal Customer 39.0 NaN Business 375 10.0 12.0 1 3.0 Green Car NaN 3.0 3.0 2.0 4.0 5.0 4.0 NaN NaN NaN 4.0 4.0 5.0
39711 98839712 Male Loyal Customer 43.0 Business Travel Business 3936 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 2.0 4.0 NaN NaN NaN NaN 5.0 3.0 5.0 5.0
39716 98839717 Female Loyal Customer 47.0 NaN Eco 401 9.0 46.0 1 5.0 Green Car 4.0 NaN 4.0 5.0 5.0 5.0 5.0 NaN NaN NaN 1.0 2.0 5.0
40634 98840635 Female Loyal Customer 27.0 Personal Travel Eco 2750 14.0 24.0 1 4.0 Ordinary NaN 3.0 3.0 4.0 4.0 4.0 4.0 NaN NaN NaN 5.0 2.0 4.0
40659 98840660 Male NaN 31.0 Business Travel Business 2139 19.0 8.0 1 2.0 Green Car 2.0 2.0 2.0 4.0 5.0 4.0 4.0 NaN NaN NaN 5.0 4.0 4.0
41681 98841682 Female Loyal Customer 43.0 Business Travel Eco 1066 0.0 0.0 1 5.0 Green Car 2.0 2.0 2.0 4.0 5.0 5.0 5.0 NaN NaN NaN 4.0 5.0 4.0
43644 98843645 Female Loyal Customer 25.0 Personal Travel Eco 2109 1.0 2.0 1 1.0 Green Car 1.0 4.0 1.0 5.0 5.0 NaN NaN NaN NaN 1.0 2.0 4.0 5.0
43715 98843716 Male Disloyal Customer 22.0 Business Travel Eco 2770 0.0 0.0 1 2.0 Ordinary 2.0 2.0 3.0 4.0 2.0 4.0 4.0 NaN NaN NaN 1.0 5.0 4.0
44050 98844051 Male Loyal Customer 19.0 Personal Travel Eco 1410 0.0 0.0 1 4.0 Ordinary 2.0 1.0 4.0 5.0 1.0 NaN NaN NaN NaN 4.0 4.0 3.0 5.0
44190 98844191 Male Loyal Customer 31.0 NaN Business 4464 3.0 0.0 1 3.0 Ordinary 3.0 NaN 3.0 5.0 5.0 5.0 5.0 NaN NaN NaN 5.0 5.0 5.0
46420 98846421 Female NaN 11.0 Personal Travel Eco 1763 0.0 10.0 1 5.0 Green Car 5.0 5.0 5.0 4.0 4.0 5.0 5.0 NaN NaN NaN 5.0 5.0 3.0
46495 98846496 Male Loyal Customer 44.0 Personal Travel Eco 1387 6.0 83.0 0 2.0 Ordinary 0.0 NaN 3.0 5.0 0.0 3.0 5.0 NaN NaN NaN 4.0 4.0 5.0
46912 98846913 Male Loyal Customer 47.0 NaN Eco 1598 0.0 0.0 0 4.0 Green Car NaN 1.0 1.0 4.0 4.0 4.0 4.0 NaN NaN NaN 3.0 3.0 4.0
47952 98847953 Female Disloyal Customer 50.0 Business Travel Eco 2062 0.0 0.0 0 3.0 Ordinary 0.0 3.0 4.0 4.0 3.0 NaN NaN NaN NaN 3.0 4.0 4.0 4.0
48013 98848014 Female Loyal Customer 38.0 NaN Business 2371 0.0 0.0 0 NaN Ordinary NaN 4.0 4.0 3.0 4.0 4.0 3.0 NaN NaN NaN 4.0 3.0 2.0
48208 98848209 Male Loyal Customer 55.0 Business Travel Business 2106 32.0 7.0 1 1.0 Green Car 1.0 NaN 1.0 4.0 5.0 5.0 4.0 NaN NaN NaN 3.0 4.0 5.0
49077 98849078 Female Disloyal Customer 44.0 Business Travel Eco 1573 105.0 85.0 0 3.0 Ordinary 4.0 4.0 4.0 3.0 4.0 4.0 3.0 NaN NaN NaN 2.0 3.0 3.0
50462 98850463 Male Loyal Customer 52.0 Business Travel Business 391 14.0 15.0 1 3.0 Ordinary 3.0 3.0 3.0 4.0 4.0 5.0 4.0 NaN NaN NaN 4.0 4.0 5.0
51384 98851385 Female Loyal Customer 51.0 Business Travel Business 4037 8.0 0.0 1 1.0 Green Car 1.0 1.0 1.0 4.0 5.0 NaN NaN NaN NaN 4.0 3.0 4.0 4.0
52133 98852134 Female Loyal Customer 34.0 Personal Travel Eco 2022 16.0 11.0 1 2.0 Green Car NaN 2.0 2.0 4.0 5.0 4.0 5.0 NaN NaN NaN 3.0 5.0 3.0
56493 98856494 Male NaN 52.0 Business Travel Business 2289 0.0 8.0 1 0.0 Ordinary NaN 0.0 2.0 4.0 4.0 5.0 1.0 NaN NaN NaN 4.0 1.0 5.0
57283 98857284 Female Loyal Customer 41.0 Business Travel Business 1952 46.0 44.0 1 1.0 Green Car 1.0 1.0 1.0 5.0 4.0 NaN NaN NaN NaN 5.0 5.0 5.0 4.0
58502 98858503 Male Loyal Customer 16.0 Business Travel Eco 1966 0.0 0.0 0 1.0 Green Car 3.0 3.0 3.0 1.0 1.0 1.0 1.0 NaN NaN NaN 4.0 4.0 1.0
59864 98859865 Male Loyal Customer 17.0 Business Travel Business 1242 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 5.0 5.0 NaN NaN NaN NaN 5.0 2.0 2.0 5.0
59980 98859981 Male NaN 41.0 Business Travel Business 3804 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 1.0 1.0 NaN NaN NaN NaN 4.0 3.0 4.0 3.0
60804 98860805 Male Loyal Customer 63.0 NaN Eco 2081 42.0 34.0 0 2.0 Green Car 4.0 2.0 4.0 1.0 2.0 1.0 1.0 NaN NaN NaN 3.0 5.0 1.0
61222 98861223 Female Loyal Customer 41.0 Business Travel Eco 622 6.0 2.0 0 2.0 Ordinary 1.0 2.0 1.0 4.0 2.0 NaN NaN NaN NaN 2.0 2.0 2.0 2.0
62337 98862338 Male Loyal Customer 50.0 NaN Business 2681 9.0 0.0 0 2.0 Ordinary 5.0 5.0 5.0 2.0 4.0 3.0 2.0 NaN NaN NaN 3.0 2.0 3.0
64619 98864620 Female Loyal Customer 52.0 Personal Travel Eco 125 127.0 123.0 1 5.0 Green Car 3.0 5.0 3.0 2.0 5.0 5.0 2.0 NaN NaN NaN 1.0 3.0 2.0
65793 98865794 Male Loyal Customer 48.0 Business Travel Eco 2701 44.0 75.0 0 1.0 Green Car 4.0 4.0 4.0 1.0 1.0 1.0 1.0 NaN NaN NaN 1.0 4.0 1.0
65831 98865832 Female Loyal Customer 54.0 Business Travel Business 3528 10.0 8.0 1 4.0 Ordinary 4.0 4.0 4.0 4.0 5.0 5.0 5.0 NaN NaN NaN 4.0 5.0 5.0
72379 98872380 Female Loyal Customer 48.0 Personal Travel Eco 584 0.0 0.0 1 5.0 Ordinary 5.0 5.0 5.0 5.0 4.0 4.0 4.0 NaN NaN NaN 4.0 4.0 3.0
74459 98874460 Male Loyal Customer 45.0 Business Travel Business 483 4.0 4.0 1 3.0 Green Car 3.0 3.0 3.0 3.0 4.0 5.0 5.0 NaN NaN NaN 4.0 5.0 5.0
77629 98877630 Male Loyal Customer 54.0 Business Travel Business 1968 0.0 0.0 1 1.0 Green Car NaN 3.0 1.0 5.0 4.0 4.0 5.0 NaN NaN NaN 4.0 5.0 3.0
79166 98879167 Male Loyal Customer 26.0 NaN Eco 3954 0.0 0.0 0 3.0 Green Car 2.0 3.0 1.0 3.0 5.0 5.0 1.0 NaN NaN NaN 5.0 2.0 5.0
80665 98880666 Male Loyal Customer 25.0 NaN Eco 1560 1.0 0.0 0 2.0 Ordinary NaN 2.0 2.0 4.0 2.0 4.0 4.0 NaN NaN NaN 2.0 1.0 4.0
80709 98880710 Male Disloyal Customer 36.0 NaN Eco 1964 104.0 100.0 0 4.0 Ordinary 2.0 4.0 4.0 5.0 4.0 4.0 5.0 NaN NaN NaN 4.0 3.0 5.0
81399 98881400 Female Loyal Customer 9.0 Business Travel Eco 2278 60.0 53.0 0 NaN Ordinary NaN 4.0 1.0 3.0 3.0 4.0 3.0 NaN NaN NaN 1.0 4.0 3.0
81504 98881505 Female Loyal Customer 47.0 Personal Travel Eco 1076 34.0 65.0 1 NaN Ordinary NaN 0.0 4.0 5.0 5.0 4.0 2.0 NaN NaN NaN 5.0 2.0 3.0
81876 98881877 Female Loyal Customer 33.0 Business Travel Eco 1713 19.0 23.0 1 NaN Green Car NaN 5.0 5.0 5.0 5.0 5.0 5.0 NaN NaN NaN 4.0 1.0 5.0
82501 98882502 Male NaN 49.0 Personal Travel Eco 1374 17.0 11.0 0 2.0 Green Car 5.0 3.0 2.0 5.0 3.0 5.0 5.0 NaN NaN NaN 4.0 3.0 5.0
82744 98882745 Female NaN 18.0 Business Travel Eco 1625 0.0 0.0 1 5.0 Green Car 3.0 5.0 2.0 4.0 5.0 NaN NaN NaN NaN 4.0 4.0 5.0 4.0
83934 98883935 Female Disloyal Customer 36.0 Business Travel Business 1917 8.0 0.0 0 4.0 Green Car 4.0 4.0 5.0 2.0 4.0 2.0 2.0 NaN NaN NaN 4.0 4.0 2.0
84434 98884435 Male Loyal Customer 40.0 Business Travel Business 479 50.0 49.0 0 1.0 Green Car 3.0 3.0 3.0 2.0 3.0 NaN NaN NaN NaN 1.0 3.0 1.0 3.0
84551 98884552 Male Loyal Customer 60.0 Business Travel Eco 2085 3.0 0.0 0 NaN Ordinary NaN 1.0 1.0 2.0 2.0 2.0 2.0 NaN NaN NaN 3.0 3.0 2.0
84876 98884877 Female Disloyal Customer 20.0 NaN Eco 1994 1.0 0.0 0 3.0 Ordinary NaN 3.0 4.0 3.0 3.0 2.0 3.0 NaN NaN NaN 2.0 3.0 3.0
85216 98885217 Female Loyal Customer 25.0 Business Travel Business 2894 0.0 0.0 1 2.0 Green Car 2.0 2.0 2.0 4.0 4.0 NaN NaN NaN NaN 3.0 4.0 4.0 4.0
85713 98885714 Female Disloyal Customer 26.0 Business Travel Eco 2061 4.0 34.0 0 NaN Green Car NaN 2.0 4.0 2.0 2.0 2.0 2.0 NaN NaN NaN 3.0 4.0 2.0
86991 98886992 Female Loyal Customer 47.0 NaN Business 3234 0.0 39.0 0 3.0 Ordinary 2.0 2.0 2.0 1.0 3.0 4.0 3.0 NaN NaN NaN 1.0 3.0 2.0
87693 98887694 Female Loyal Customer 40.0 Business Travel Eco 2361 0.0 0.0 0 NaN Green Car NaN 5.0 5.0 2.0 2.0 2.0 2.0 NaN NaN NaN 3.0 3.0 2.0
87755 98887756 Female Loyal Customer 39.0 Personal Travel Eco 1630 34.0 24.0 1 5.0 Ordinary 1.0 1.0 1.0 5.0 5.0 NaN NaN NaN NaN 1.0 5.0 4.0 5.0
88765 98888766 Female Disloyal Customer 49.0 Business Travel Eco 1788 22.0 18.0 0 NaN Green Car NaN 3.0 4.0 2.0 3.0 1.0 2.0 NaN NaN NaN 2.0 3.0 2.0
88852 98888853 Female NaN 66.0 Personal Travel Eco 696 10.0 0.0 1 5.0 Green Car 5.0 5.0 5.0 2.0 5.0 4.0 3.0 NaN NaN NaN 3.0 3.0 4.0
89951 98889952 Female Loyal Customer 37.0 Business Travel Business 2810 3.0 1.0 1 4.0 Ordinary 3.0 4.0 4.0 2.0 3.0 NaN NaN NaN NaN 4.0 3.0 4.0 4.0
90252 98890253 Male Loyal Customer 50.0 Personal Travel Eco 2132 7.0 39.0 0 3.0 Green Car 4.0 3.0 3.0 5.0 3.0 NaN NaN NaN NaN 4.0 3.0 5.0 5.0
91805 98891806 Female Loyal Customer 56.0 Personal Travel Business 984 0.0 0.0 1 0.0 Ordinary 3.0 1.0 3.0 1.0 1.0 NaN NaN NaN NaN 5.0 5.0 1.0 2.0

Baggage Handling¶

In [ ]:
labeled_countplot(train,'Baggage_Handling', perc = True, order = True)
Number of null values:  142
In [ ]:
missing_bh = train.loc[train['Baggage_Handling'].isnull() == True]
missing_bh
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
13 98800014 Female Loyal Customer 47.0 Personal Travel Eco 1100 20.0 34.0 0 4.0 Ordinary 4.0 4.0 3.0 4.0 5.0 NaN NaN 3.0 4.0 NaN NaN 3.0 4.0
2214 98802215 Male Loyal Customer 30.0 NaN Business 4725 86.0 77.0 1 NaN Green Car NaN 1.0 3.0 4.0 4.0 4.0 4.0 NaN NaN NaN 4.0 5.0 4.0
2666 98802667 Female Loyal Customer 39.0 Business Travel Business 1995 43.0 30.0 1 2.0 Green Car 2.0 5.0 2.0 5.0 5.0 4.0 5.0 NaN NaN NaN 5.0 5.0 3.0
2821 98802822 Male Loyal Customer 30.0 NaN Eco 1018 21.0 11.0 1 5.0 Green Car 1.0 3.0 3.0 5.0 5.0 NaN NaN 2.0 3.0 NaN NaN 5.0 5.0
3249 98803250 Male Loyal Customer 43.0 Business Travel Business 807 27.0 13.0 0 3.0 Ordinary NaN 1.0 1.0 3.0 3.0 NaN NaN 3.0 3.0 NaN NaN 3.0 4.0
4754 98804755 Male Loyal Customer 63.0 Personal Travel Eco 1380 0.0 0.0 0 NaN Ordinary NaN 2.0 3.0 5.0 2.0 1.0 5.0 NaN NaN NaN 5.0 4.0 5.0
5551 98805552 Female Disloyal Customer 24.0 Business Travel Business 2093 14.0 3.0 1 5.0 Ordinary 0.0 5.0 4.0 1.0 5.0 1.0 1.0 NaN NaN NaN 3.0 4.0 1.0
5912 98805913 Male NaN 60.0 Business Travel Business 2931 0.0 7.0 0 4.0 Ordinary NaN 3.0 3.0 2.0 3.0 NaN NaN 4.0 4.0 NaN NaN 4.0 3.0
7019 98807020 Male Loyal Customer 68.0 Business Travel Business 2216 9.0 8.0 1 2.0 Green Car 2.0 2.0 2.0 4.0 5.0 5.0 4.0 NaN NaN NaN 3.0 4.0 4.0
8490 98808491 Female Loyal Customer 39.0 Business Travel Eco 1814 0.0 0.0 1 5.0 Green Car 4.0 4.0 4.0 5.0 5.0 5.0 5.0 NaN NaN NaN 1.0 5.0 5.0
8936 98808937 Female Loyal Customer 56.0 Business Travel Eco 273 0.0 0.0 1 NaN Ordinary NaN 2.0 2.0 1.0 1.0 3.0 3.0 NaN NaN NaN 4.0 3.0 4.0
10241 98810242 Female Disloyal Customer 14.0 Personal Travel Business 1966 0.0 0.0 0 3.0 Ordinary 5.0 3.0 4.0 1.0 3.0 1.0 1.0 NaN NaN NaN 4.0 4.0 1.0
10656 98810657 Male Loyal Customer 10.0 Personal Travel Eco 2163 0.0 0.0 0 1.0 Ordinary 5.0 NaN 2.0 3.0 1.0 NaN NaN 5.0 3.0 NaN NaN 5.0 3.0
11079 98811080 Female Loyal Customer 70.0 Personal Travel Eco 912 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 4.0 4.0 4.0 4.0 NaN NaN NaN 4.0 4.0 4.0
11134 98811135 Male Loyal Customer 68.0 Personal Travel Eco 1568 87.0 94.0 0 2.0 Green Car 2.0 NaN 3.0 1.0 0.0 1.0 1.0 1.0 3.0 NaN NaN 4.0 1.0
11199 98811200 Male Disloyal Customer 24.0 Business Travel Eco 1732 39.0 6.0 0 3.0 Ordinary 3.0 3.0 3.0 5.0 3.0 5.0 5.0 NaN NaN NaN 1.0 3.0 5.0
11203 98811204 Male Loyal Customer 37.0 Business Travel Business 2637 0.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 5.0 3.0 5.0 5.0 5.0 4.0 NaN NaN 5.0 5.0
12755 98812756 Female Loyal Customer 47.0 NaN Eco 3113 1592.0 1584.0 0 2.0 Ordinary 2.0 2.0 3.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 3.0 2.0
12758 98812759 Male Loyal Customer 61.0 Personal Travel Eco 1822 0.0 0.0 1 0.0 Ordinary 5.0 0.0 3.0 3.0 0.0 3.0 3.0 NaN NaN NaN 5.0 3.0 3.0
15119 98815120 Female Loyal Customer 53.0 NaN Business 1646 0.0 0.0 1 5.0 Green Car 5.0 NaN 5.0 3.0 3.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
16347 98816348 Male Disloyal Customer 43.0 Business Travel Business 1491 53.0 53.0 0 3.0 Green Car 2.0 2.0 3.0 4.0 3.0 NaN NaN 1.0 5.0 NaN NaN 4.0 4.0
17364 98817365 Female Loyal Customer 63.0 Personal Travel Business 442 0.0 0.0 1 0.0 Green Car 0.0 NaN 3.0 3.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 4.0
18741 98818742 Male Loyal Customer 25.0 Business Travel Business 4957 5.0 0.0 1 1.0 Green Car 1.0 3.0 1.0 4.0 4.0 4.0 4.0 5.0 2.0 NaN NaN 5.0 4.0
18978 98818979 Male NaN 37.0 Business Travel Eco 1575 0.0 0.0 0 1.0 Ordinary 3.0 3.0 3.0 1.0 1.0 1.0 1.0 NaN NaN NaN 4.0 3.0 1.0
19290 98819291 Male Loyal Customer 47.0 Business Travel Business 2007 42.0 31.0 1 1.0 Ordinary 1.0 1.0 1.0 5.0 5.0 4.0 2.0 NaN NaN NaN 4.0 2.0 3.0
20440 98820441 Female NaN 56.0 Business Travel Business 3676 0.0 0.0 1 2.0 Green Car 2.0 2.0 2.0 2.0 4.0 4.0 4.0 NaN NaN NaN 5.0 4.0 5.0
20664 98820665 Female Loyal Customer 57.0 Business Travel Business 3459 0.0 0.0 1 4.0 Green Car 4.0 4.0 4.0 3.0 5.0 4.0 5.0 NaN NaN NaN 5.0 5.0 5.0
22362 98822363 Male Loyal Customer 63.0 Personal Travel Eco 1209 0.0 0.0 0 3.0 Green Car 5.0 3.0 2.0 2.0 3.0 2.0 2.0 NaN NaN NaN 3.0 4.0 2.0
22547 98822548 Female Disloyal Customer 50.0 NaN Business 1802 0.0 0.0 0 4.0 Ordinary 4.0 NaN 2.0 3.0 4.0 3.0 3.0 3.0 4.0 NaN NaN 4.0 3.0
23692 98823693 Female Loyal Customer 31.0 Business Travel Business 1945 0.0 0.0 0 1.0 Ordinary NaN 4.0 4.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN 3.0 1.0
23718 98823719 Male NaN 65.0 Personal Travel Eco 2320 0.0 0.0 0 3.0 Green Car 3.0 3.0 4.0 4.0 3.0 1.0 4.0 3.0 1.0 NaN NaN 3.0 4.0
24017 98824018 Male Loyal Customer 28.0 Business Travel Business 4115 29.0 0.0 1 5.0 Ordinary NaN 5.0 5.0 2.0 2.0 NaN NaN 5.0 3.0 NaN NaN 3.0 2.0
24928 98824929 Male Loyal Customer 55.0 NaN Business 2404 0.0 0.0 0 4.0 Green Car 1.0 1.0 1.0 2.0 4.0 NaN NaN NaN 4.0 NaN NaN 4.0 4.0
26364 98826365 Male Loyal Customer 46.0 Business Travel Eco 2195 0.0 21.0 0 3.0 Green Car 4.0 4.0 4.0 3.0 3.0 3.0 3.0 NaN NaN NaN 4.0 4.0 3.0
28146 98828147 Female Loyal Customer 37.0 Personal Travel Eco 2395 0.0 0.0 1 2.0 Ordinary 2.0 2.0 2.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
28858 98828859 Male Disloyal Customer 26.0 Business Travel Business 1112 0.0 0.0 0 NaN Green Car NaN 0.0 3.0 4.0 0.0 4.0 4.0 NaN NaN NaN 3.0 4.0 4.0
30709 98830710 Female Loyal Customer 65.0 Personal Travel Eco 617 3.0 24.0 0 3.0 Ordinary 5.0 3.0 5.0 3.0 4.0 5.0 1.0 NaN 3.0 NaN NaN 1.0 4.0
31097 98831098 Male Loyal Customer 45.0 Business Travel Eco 1639 0.0 0.0 1 4.0 Green Car 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN 2.0 5.0 4.0
31371 98831372 Female Loyal Customer 21.0 Personal Travel Eco 2915 23.0 0.0 1 4.0 Green Car NaN 4.0 4.0 2.0 2.0 2.0 4.0 NaN NaN NaN 1.0 4.0 1.0
32086 98832087 Male Loyal Customer 39.0 Business Travel Business 302 87.0 90.0 1 4.0 Ordinary 4.0 4.0 NaN NaN 5.0 4.0 4.0 4.0 4.0 NaN NaN 4.0 4.0
32913 98832914 Female Loyal Customer 53.0 Business Travel Business 1566 16.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 4.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
33114 98833115 Male Disloyal Customer 30.0 Business Travel Eco 2180 0.0 0.0 0 NaN Ordinary NaN 2.0 4.0 1.0 2.0 1.0 1.0 NaN NaN NaN 5.0 5.0 1.0
33511 98833512 Female NaN 37.0 Personal Travel Eco 2617 5.0 3.0 1 4.0 Ordinary 4.0 2.0 4.0 2.0 3.0 2.0 4.0 NaN NaN NaN 1.0 4.0 4.0
33809 98833810 Male Loyal Customer 47.0 NaN Eco 1577 0.0 4.0 0 2.0 Green Car 5.0 2.0 2.0 5.0 2.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
34067 98834068 Female Loyal Customer 52.0 NaN Business 547 0.0 0.0 1 5.0 Green Car 5.0 NaN 5.0 4.0 5.0 5.0 4.0 4.0 4.0 NaN NaN 4.0 4.0
34365 98834366 Female Loyal Customer 27.0 Personal Travel Eco 2121 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 5.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 4.0
34589 98834590 Male Loyal Customer 26.0 Personal Travel Eco 2001 0.0 0.0 0 4.0 Green Car 3.0 4.0 3.0 5.0 4.0 5.0 5.0 NaN NaN NaN 3.0 1.0 5.0
35112 98835113 Female Loyal Customer 37.0 Personal Travel Eco 1712 0.0 20.0 1 5.0 Green Car 5.0 3.0 5.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
35214 98835215 Male Loyal Customer 44.0 Business Travel Business 2476 6.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 4.0 5.0 NaN NaN 3.0 3.0 NaN NaN 3.0 5.0
35253 98835254 Male Loyal Customer 33.0 Personal Travel Eco 914 16.0 17.0 0 4.0 Green Car 1.0 0.0 3.0 5.0 0.0 4.0 5.0 1.0 2.0 NaN NaN 3.0 5.0
35928 98835929 Female Disloyal Customer 21.0 Business Travel Business 1543 0.0 0.0 1 4.0 Green Car 5.0 4.0 3.0 3.0 4.0 NaN NaN 1.0 5.0 NaN NaN 1.0 3.0
36254 98836255 Male Disloyal Customer 32.0 Business Travel Eco 1706 0.0 2.0 0 NaN Ordinary NaN 3.0 2.0 1.0 3.0 1.0 1.0 NaN NaN NaN 5.0 1.0 1.0
37552 98837553 Female Loyal Customer 51.0 Business Travel Business 1555 0.0 0.0 1 3.0 Green Car 3.0 NaN 3.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
38172 98838173 Female Loyal Customer 53.0 Business Travel Business 449 0.0 3.0 1 2.0 Green Car 5.0 NaN 2.0 3.0 1.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
38731 98838732 Male NaN 35.0 Business Travel Business 2403 0.0 0.0 1 5.0 Ordinary NaN 5.0 5.0 4.0 4.0 1.0 2.0 NaN NaN NaN 4.0 2.0 2.0
39389 98839390 Female Loyal Customer 39.0 NaN Business 375 10.0 12.0 1 3.0 Green Car NaN 3.0 3.0 2.0 4.0 5.0 4.0 NaN NaN NaN 4.0 4.0 5.0
39716 98839717 Female Loyal Customer 47.0 NaN Eco 401 9.0 46.0 1 5.0 Green Car 4.0 NaN 4.0 5.0 5.0 5.0 5.0 NaN NaN NaN 1.0 2.0 5.0
40634 98840635 Female Loyal Customer 27.0 Personal Travel Eco 2750 14.0 24.0 1 4.0 Ordinary NaN 3.0 3.0 4.0 4.0 4.0 4.0 NaN NaN NaN 5.0 2.0 4.0
40659 98840660 Male NaN 31.0 Business Travel Business 2139 19.0 8.0 1 2.0 Green Car 2.0 2.0 2.0 4.0 5.0 4.0 4.0 NaN NaN NaN 5.0 4.0 4.0
40893 98840894 Female Disloyal Customer 27.0 Business Travel Eco 1280 19.0 8.0 0 2.0 Ordinary 3.0 2.0 3.0 3.0 2.0 NaN NaN 3.0 2.0 NaN NaN 3.0 3.0
41681 98841682 Female Loyal Customer 43.0 Business Travel Eco 1066 0.0 0.0 1 5.0 Green Car 2.0 2.0 2.0 4.0 5.0 5.0 5.0 NaN NaN NaN 4.0 5.0 4.0
43715 98843716 Male Disloyal Customer 22.0 Business Travel Eco 2770 0.0 0.0 1 2.0 Ordinary 2.0 2.0 3.0 4.0 2.0 4.0 4.0 NaN NaN NaN 1.0 5.0 4.0
43836 98843837 Female Loyal Customer 13.0 Personal Travel Eco 1880 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
44190 98844191 Male Loyal Customer 31.0 NaN Business 4464 3.0 0.0 1 3.0 Ordinary 3.0 NaN 3.0 5.0 5.0 5.0 5.0 NaN NaN NaN 5.0 5.0 5.0
45055 98845056 Female Loyal Customer 30.0 NaN Business 4707 1.0 20.0 0 2.0 Ordinary 4.0 4.0 4.0 2.0 2.0 NaN NaN 3.0 1.0 NaN NaN 3.0 2.0
46420 98846421 Female NaN 11.0 Personal Travel Eco 1763 0.0 10.0 1 5.0 Green Car 5.0 5.0 5.0 4.0 4.0 5.0 5.0 NaN NaN NaN 5.0 5.0 3.0
46495 98846496 Male Loyal Customer 44.0 Personal Travel Eco 1387 6.0 83.0 0 2.0 Ordinary 0.0 NaN 3.0 5.0 0.0 3.0 5.0 NaN NaN NaN 4.0 4.0 5.0
46703 98846704 Female Disloyal Customer 24.0 NaN Eco 2008 130.0 127.0 0 3.0 Green Car 2.0 3.0 3.0 3.0 3.0 NaN NaN 3.0 4.0 NaN NaN 3.0 3.0
46912 98846913 Male Loyal Customer 47.0 NaN Eco 1598 0.0 0.0 0 4.0 Green Car NaN 1.0 1.0 4.0 4.0 4.0 4.0 NaN NaN NaN 3.0 3.0 4.0
47006 98847007 Female Disloyal Customer 25.0 Business Travel Eco 1098 0.0 0.0 0 1.0 Ordinary NaN NaN NaN NaN NaN NaN 3.0 5.0 3.0 NaN NaN 2.0 3.0
48013 98848014 Female Loyal Customer 38.0 NaN Business 2371 0.0 0.0 0 NaN Ordinary NaN 4.0 4.0 3.0 4.0 4.0 3.0 NaN NaN NaN 4.0 3.0 2.0
48208 98848209 Male Loyal Customer 55.0 Business Travel Business 2106 32.0 7.0 1 1.0 Green Car 1.0 NaN 1.0 4.0 5.0 5.0 4.0 NaN NaN NaN 3.0 4.0 5.0
49077 98849078 Female Disloyal Customer 44.0 Business Travel Eco 1573 105.0 85.0 0 3.0 Ordinary 4.0 4.0 4.0 3.0 4.0 4.0 3.0 NaN NaN NaN 2.0 3.0 3.0
50305 98850306 Male Loyal Customer 23.0 Personal Travel Eco 2534 0.0 0.0 0 2.0 Green Car NaN 2.0 3.0 4.0 2.0 NaN NaN 3.0 4.0 NaN NaN 3.0 4.0
50462 98850463 Male Loyal Customer 52.0 Business Travel Business 391 14.0 15.0 1 3.0 Ordinary 3.0 3.0 3.0 4.0 4.0 5.0 4.0 NaN NaN NaN 4.0 4.0 5.0
50701 98850702 Male Loyal Customer 63.0 Personal Travel Eco 2443 1.0 0.0 0 3.0 Green Car 5.0 4.0 5.0 3.0 4.0 NaN NaN 5.0 4.0 NaN NaN 4.0 3.0
51471 98851472 Male Loyal Customer 73.0 Business Travel Eco 1898 14.0 23.0 0 3.0 Ordinary 3.0 3.0 3.0 3.0 3.0 3.0 3.0 1.0 4.0 NaN NaN 3.0 3.0
51963 98851964 Male Loyal Customer 55.0 Business Travel Eco 2255 0.0 0.0 1 4.0 Ordinary NaN 5.0 5.0 4.0 4.0 NaN NaN 5.0 1.0 NaN NaN 4.0 4.0
52133 98852134 Female Loyal Customer 34.0 Personal Travel Eco 2022 16.0 11.0 1 2.0 Green Car NaN 2.0 2.0 4.0 5.0 4.0 5.0 NaN NaN NaN 3.0 5.0 3.0
52205 98852206 Female Loyal Customer 19.0 Business Travel Business 3622 2.0 19.0 1 5.0 Green Car 5.0 5.0 5.0 4.0 4.0 4.0 4.0 3.0 1.0 NaN NaN 1.0 4.0
52879 98852880 Female Disloyal Customer 25.0 Business Travel Eco 2720 0.0 0.0 0 2.0 Green Car 2.0 2.0 3.0 3.0 2.0 3.0 3.0 4.0 2.0 NaN NaN 4.0 3.0
53370 98853371 Male Loyal Customer 40.0 Personal Travel Eco 2251 0.0 8.0 0 1.0 Green Car 1.0 1.0 1.0 1.0 1.0 NaN NaN 3.0 3.0 NaN NaN 1.0 1.0
53604 98853605 Male Loyal Customer 34.0 Business Travel Business 1856 5.0 0.0 1 1.0 Ordinary 1.0 NaN 1.0 4.0 4.0 5.0 5.0 5.0 5.0 NaN NaN 5.0 4.0
55454 98855455 Male Disloyal Customer 37.0 Business Travel Business 2690 5.0 0.0 0 2.0 Green Car 2.0 2.0 5.0 3.0 2.0 1.0 3.0 NaN 1.0 NaN NaN 1.0 3.0
56493 98856494 Male NaN 52.0 Business Travel Business 2289 0.0 8.0 1 0.0 Ordinary NaN 0.0 2.0 4.0 4.0 5.0 1.0 NaN NaN NaN 4.0 1.0 5.0
57218 98857219 Female Loyal Customer 18.0 Personal Travel Eco 1853 36.0 27.0 0 4.0 Green Car 3.0 4.0 4.0 2.0 4.0 3.0 2.0 1.0 1.0 NaN NaN 1.0 2.0
58003 98858004 Female Loyal Customer 25.0 Business Travel Eco 3113 18.0 11.0 1 4.0 Ordinary 5.0 5.0 5.0 4.0 4.0 5.0 4.0 3.0 2.0 NaN NaN 2.0 4.0
58502 98858503 Male Loyal Customer 16.0 Business Travel Eco 1966 0.0 0.0 0 1.0 Green Car 3.0 3.0 3.0 1.0 1.0 1.0 1.0 NaN NaN NaN 4.0 4.0 1.0
58878 98858879 Female Loyal Customer 52.0 Personal Travel Eco 1789 2.0 0.0 1 4.0 Ordinary 5.0 NaN 4.0 5.0 5.0 NaN NaN 2.0 5.0 NaN NaN 2.0 3.0
58982 98858983 Male Disloyal Customer 7.0 Business Travel Eco 2016 22.0 11.0 0 4.0 Green Car 2.0 4.0 4.0 5.0 4.0 NaN NaN 4.0 2.0 NaN NaN 4.0 5.0
59121 98859122 Male Loyal Customer 34.0 Business Travel Business 641 0.0 0.0 1 3.0 Green Car NaN 3.0 3.0 2.0 4.0 5.0 4.0 4.0 4.0 NaN NaN 4.0 4.0
60804 98860805 Male Loyal Customer 63.0 NaN Eco 2081 42.0 34.0 0 2.0 Green Car 4.0 2.0 4.0 1.0 2.0 1.0 1.0 NaN NaN NaN 3.0 5.0 1.0
61021 98861022 Male Loyal Customer 39.0 Business Travel Business 1932 14.0 33.0 1 3.0 Green Car 3.0 NaN 3.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
62337 98862338 Male Loyal Customer 50.0 NaN Business 2681 9.0 0.0 0 2.0 Ordinary 5.0 5.0 5.0 2.0 4.0 3.0 2.0 NaN NaN NaN 3.0 2.0 3.0
64619 98864620 Female Loyal Customer 52.0 Personal Travel Eco 125 127.0 123.0 1 5.0 Green Car 3.0 5.0 3.0 2.0 5.0 5.0 2.0 NaN NaN NaN 1.0 3.0 2.0
64797 98864798 Female Loyal Customer 38.0 Business Travel Eco 1721 44.0 65.0 1 5.0 Ordinary 5.0 5.0 5.0 5.0 5.0 5.0 5.0 2.0 2.0 NaN NaN 1.0 5.0
65793 98865794 Male Loyal Customer 48.0 Business Travel Eco 2701 44.0 75.0 0 1.0 Green Car 4.0 4.0 4.0 1.0 1.0 1.0 1.0 NaN NaN NaN 1.0 4.0 1.0
65831 98865832 Female Loyal Customer 54.0 Business Travel Business 3528 10.0 8.0 1 4.0 Ordinary 4.0 4.0 4.0 4.0 5.0 5.0 5.0 NaN NaN NaN 4.0 5.0 5.0
66977 98866978 Female Loyal Customer 53.0 Business Travel Business 922 6.0 5.0 1 3.0 Green Car 3.0 3.0 3.0 1.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 2.0
68315 98868316 Female Disloyal Customer 22.0 Business Travel Eco 2237 9.0 0.0 1 2.0 Ordinary 3.0 3.0 3.0 4.0 3.0 NaN NaN 3.0 3.0 NaN NaN 3.0 4.0
68326 98868327 Male Loyal Customer 60.0 Business Travel Eco 2304 42.0 39.0 0 1.0 Ordinary 3.0 3.0 3.0 1.0 1.0 NaN NaN 1.0 2.0 NaN NaN 4.0 1.0
68796 98868797 Female Loyal Customer 54.0 Personal Travel Eco 245 0.0 6.0 1 5.0 Green Car 5.0 5.0 5.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
69242 98869243 Male Loyal Customer 29.0 Business Travel Business 3522 12.0 7.0 1 5.0 Green Car 5.0 5.0 5.0 4.0 4.0 4.0 4.0 5.0 4.0 NaN NaN 5.0 4.0
70149 98870150 Female NaN 41.0 Business Travel Business 497 0.0 0.0 0 2.0 Green Car 4.0 4.0 4.0 4.0 4.0 3.0 2.0 2.0 3.0 NaN NaN 2.0 2.0
70225 98870226 Female Loyal Customer 14.0 Personal Travel Eco 1836 3.0 9.0 0 4.0 Green Car 3.0 4.0 3.0 5.0 4.0 5.0 5.0 1.0 2.0 NaN NaN 5.0 5.0
70627 98870628 Female Loyal Customer 49.0 Business Travel Business 2522 0.0 0.0 1 5.0 Ordinary 5.0 NaN 5.0 5.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 3.0
71691 98871692 Female Loyal Customer 62.0 Business Travel Business 329 10.0 5.0 0 1.0 Ordinary 1.0 1.0 1.0 2.0 2.0 3.0 1.0 1.0 1.0 NaN NaN 1.0 2.0
72379 98872380 Female Loyal Customer 48.0 Personal Travel Eco 584 0.0 0.0 1 5.0 Ordinary 5.0 5.0 5.0 5.0 4.0 4.0 4.0 NaN NaN NaN 4.0 4.0 3.0
72395 98872396 Female Disloyal Customer 32.0 Business Travel Eco 2433 38.0 50.0 0 2.0 Ordinary 2.0 2.0 3.0 4.0 2.0 4.0 4.0 4.0 1.0 NaN NaN 4.0 4.0
72549 98872550 Female Loyal Customer 39.0 Business Travel Business 629 0.0 0.0 1 5.0 Green Car 5.0 5.0 5.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
73825 98873826 Male Loyal Customer 8.0 Business Travel Business 2988 2.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 2.0 2.0 NaN NaN 5.0 5.0 NaN NaN 4.0 2.0
74459 98874460 Male Loyal Customer 45.0 Business Travel Business 483 4.0 4.0 1 3.0 Green Car 3.0 3.0 3.0 3.0 4.0 5.0 5.0 NaN NaN NaN 4.0 5.0 5.0
77629 98877630 Male Loyal Customer 54.0 Business Travel Business 1968 0.0 0.0 1 1.0 Green Car NaN 3.0 1.0 5.0 4.0 4.0 5.0 NaN NaN NaN 4.0 5.0 3.0
79166 98879167 Male Loyal Customer 26.0 NaN Eco 3954 0.0 0.0 0 3.0 Green Car 2.0 3.0 1.0 3.0 5.0 5.0 1.0 NaN NaN NaN 5.0 2.0 5.0
80131 98880132 Male Loyal Customer 52.0 Business Travel Business 3161 0.0 2.0 1 5.0 Green Car 5.0 5.0 5.0 3.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
80665 98880666 Male Loyal Customer 25.0 NaN Eco 1560 1.0 0.0 0 2.0 Ordinary NaN 2.0 2.0 4.0 2.0 4.0 4.0 NaN NaN NaN 2.0 1.0 4.0
80709 98880710 Male Disloyal Customer 36.0 NaN Eco 1964 104.0 100.0 0 4.0 Ordinary 2.0 4.0 4.0 5.0 4.0 4.0 5.0 NaN NaN NaN 4.0 3.0 5.0
81304 98881305 Male Loyal Customer 50.0 Business Travel Eco 1707 0.0 0.0 1 4.0 Green Car 2.0 2.0 2.0 4.0 4.0 NaN NaN 2.0 5.0 NaN NaN 1.0 4.0
81334 98881335 Male Disloyal Customer 23.0 Business Travel Business 1726 0.0 0.0 1 5.0 Ordinary 0.0 5.0 3.0 4.0 5.0 2.0 4.0 5.0 2.0 NaN NaN 5.0 4.0
81399 98881400 Female Loyal Customer 9.0 Business Travel Eco 2278 60.0 53.0 0 NaN Ordinary NaN 4.0 1.0 3.0 3.0 4.0 3.0 NaN NaN NaN 1.0 4.0 3.0
81504 98881505 Female Loyal Customer 47.0 Personal Travel Eco 1076 34.0 65.0 1 NaN Ordinary NaN 0.0 4.0 5.0 5.0 4.0 2.0 NaN NaN NaN 5.0 2.0 3.0
81664 98881665 Male Loyal Customer 60.0 Business Travel Business 2212 0.0 0.0 1 1.0 Green Car 1.0 5.0 1.0 5.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81800 98881801 Male Loyal Customer 58.0 Business Travel Business 2009 0.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 3.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81876 98881877 Female Loyal Customer 33.0 Business Travel Eco 1713 19.0 23.0 1 NaN Green Car NaN 5.0 5.0 5.0 5.0 5.0 5.0 NaN NaN NaN 4.0 1.0 5.0
81996 98881997 Male Loyal Customer 34.0 Business Travel Business 3210 80.0 73.0 1 1.0 Ordinary 1.0 4.0 1.0 1.0 2.0 NaN NaN 5.0 5.0 NaN NaN 5.0 2.0
82501 98882502 Male NaN 49.0 Personal Travel Eco 1374 17.0 11.0 0 2.0 Green Car 5.0 3.0 2.0 5.0 3.0 5.0 5.0 NaN NaN NaN 4.0 3.0 5.0
82664 98882665 Female Disloyal Customer 58.0 Business Travel Business 2089 0.0 0.0 0 3.0 Green Car NaN 3.0 1.0 2.0 3.0 5.0 2.0 4.0 2.0 NaN NaN 4.0 2.0
83052 98883053 Female Loyal Customer 42.0 Personal Travel Eco 890 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 4.0 4.0 NaN NaN 5.0 3.0 NaN NaN 4.0 4.0
83881 98883882 Male Loyal Customer 11.0 Personal Travel Eco 2883 0.0 0.0 0 2.0 Ordinary 4.0 2.0 1.0 1.0 2.0 NaN NaN 5.0 3.0 NaN NaN 4.0 1.0
83934 98883935 Female Disloyal Customer 36.0 Business Travel Business 1917 8.0 0.0 0 4.0 Green Car 4.0 4.0 5.0 2.0 4.0 2.0 2.0 NaN NaN NaN 4.0 4.0 2.0
84335 98884336 Male Disloyal Customer 36.0 Business Travel Eco 1595 8.0 5.0 0 2.0 Green Car 2.0 1.0 NaN NaN 1.0 5.0 5.0 4.0 5.0 NaN NaN 4.0 5.0
84551 98884552 Male Loyal Customer 60.0 Business Travel Eco 2085 3.0 0.0 0 NaN Ordinary NaN 1.0 1.0 2.0 2.0 2.0 2.0 NaN NaN NaN 3.0 3.0 2.0
84798 98884799 Female Loyal Customer 57.0 Business Travel Business 3931 58.0 64.0 1 5.0 Ordinary 5.0 NaN 5.0 5.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
84876 98884877 Female Disloyal Customer 20.0 NaN Eco 1994 1.0 0.0 0 3.0 Ordinary NaN 3.0 4.0 3.0 3.0 2.0 3.0 NaN NaN NaN 2.0 3.0 3.0
85713 98885714 Female Disloyal Customer 26.0 Business Travel Eco 2061 4.0 34.0 0 NaN Green Car NaN 2.0 4.0 2.0 2.0 2.0 2.0 NaN NaN NaN 3.0 4.0 2.0
86180 98886181 Male Loyal Customer 44.0 Business Travel Eco 1574 0.0 0.0 1 5.0 Green Car NaN 2.0 2.0 5.0 5.0 NaN NaN 1.0 5.0 NaN NaN 1.0 5.0
86991 98886992 Female Loyal Customer 47.0 NaN Business 3234 0.0 39.0 0 3.0 Ordinary 2.0 2.0 2.0 1.0 3.0 4.0 3.0 NaN NaN NaN 1.0 3.0 2.0
87129 98887130 Male Loyal Customer 34.0 Business Travel Business 3925 5.0 33.0 1 5.0 Ordinary 5.0 5.0 5.0 5.0 3.0 NaN NaN NaN 5.0 NaN NaN 5.0 1.0
87693 98887694 Female Loyal Customer 40.0 Business Travel Eco 2361 0.0 0.0 0 NaN Green Car NaN 5.0 5.0 2.0 2.0 2.0 2.0 NaN NaN NaN 3.0 3.0 2.0
88765 98888766 Female Disloyal Customer 49.0 Business Travel Eco 1788 22.0 18.0 0 NaN Green Car NaN 3.0 4.0 2.0 3.0 1.0 2.0 NaN NaN NaN 2.0 3.0 2.0
88852 98888853 Female NaN 66.0 Personal Travel Eco 696 10.0 0.0 1 5.0 Green Car 5.0 5.0 5.0 2.0 5.0 4.0 3.0 NaN NaN NaN 3.0 3.0 4.0
94141 98894142 Male Loyal Customer 29.0 Business Travel Business 2700 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 5.0 5.0 5.0 5.0 5.0 4.0 NaN NaN 5.0 5.0

CheckIn Service¶

In [ ]:
labeled_countplot(train,'CheckIn_Service', perc = True, order = True)
Number of null values:  77
In [ ]:
train.loc[train['CheckIn_Service'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
13 98800014 Female Loyal Customer 47.0 Personal Travel Eco 1100 20.0 34.0 0 4.0 Ordinary 4.0 4.0 3.0 4.0 5.0 NaN NaN 3.0 4.0 NaN NaN 3.0 4.0
2821 98802822 Male Loyal Customer 30.0 NaN Eco 1018 21.0 11.0 1 5.0 Green Car 1.0 3.0 3.0 5.0 5.0 NaN NaN 2.0 3.0 NaN NaN 5.0 5.0
3249 98803250 Male Loyal Customer 43.0 Business Travel Business 807 27.0 13.0 0 3.0 Ordinary NaN 1.0 1.0 3.0 3.0 NaN NaN 3.0 3.0 NaN NaN 3.0 4.0
5912 98805913 Male NaN 60.0 Business Travel Business 2931 0.0 7.0 0 4.0 Ordinary NaN 3.0 3.0 2.0 3.0 NaN NaN 4.0 4.0 NaN NaN 4.0 3.0
10656 98810657 Male Loyal Customer 10.0 Personal Travel Eco 2163 0.0 0.0 0 1.0 Ordinary 5.0 NaN 2.0 3.0 1.0 NaN NaN 5.0 3.0 NaN NaN 5.0 3.0
11134 98811135 Male Loyal Customer 68.0 Personal Travel Eco 1568 87.0 94.0 0 2.0 Green Car 2.0 NaN 3.0 1.0 0.0 1.0 1.0 1.0 3.0 NaN NaN 4.0 1.0
11203 98811204 Male Loyal Customer 37.0 Business Travel Business 2637 0.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 5.0 3.0 5.0 5.0 5.0 4.0 NaN NaN 5.0 5.0
12755 98812756 Female Loyal Customer 47.0 NaN Eco 3113 1592.0 1584.0 0 2.0 Ordinary 2.0 2.0 3.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 3.0 2.0
15119 98815120 Female Loyal Customer 53.0 NaN Business 1646 0.0 0.0 1 5.0 Green Car 5.0 NaN 5.0 3.0 3.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
16347 98816348 Male Disloyal Customer 43.0 Business Travel Business 1491 53.0 53.0 0 3.0 Green Car 2.0 2.0 3.0 4.0 3.0 NaN NaN 1.0 5.0 NaN NaN 4.0 4.0
17364 98817365 Female Loyal Customer 63.0 Personal Travel Business 442 0.0 0.0 1 0.0 Green Car 0.0 NaN 3.0 3.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 4.0
18741 98818742 Male Loyal Customer 25.0 Business Travel Business 4957 5.0 0.0 1 1.0 Green Car 1.0 3.0 1.0 4.0 4.0 4.0 4.0 5.0 2.0 NaN NaN 5.0 4.0
22547 98822548 Female Disloyal Customer 50.0 NaN Business 1802 0.0 0.0 0 4.0 Ordinary 4.0 NaN 2.0 3.0 4.0 3.0 3.0 3.0 4.0 NaN NaN 4.0 3.0
23692 98823693 Female Loyal Customer 31.0 Business Travel Business 1945 0.0 0.0 0 1.0 Ordinary NaN 4.0 4.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN 3.0 1.0
23718 98823719 Male NaN 65.0 Personal Travel Eco 2320 0.0 0.0 0 3.0 Green Car 3.0 3.0 4.0 4.0 3.0 1.0 4.0 3.0 1.0 NaN NaN 3.0 4.0
24017 98824018 Male Loyal Customer 28.0 Business Travel Business 4115 29.0 0.0 1 5.0 Ordinary NaN 5.0 5.0 2.0 2.0 NaN NaN 5.0 3.0 NaN NaN 3.0 2.0
24928 98824929 Male Loyal Customer 55.0 NaN Business 2404 0.0 0.0 0 4.0 Green Car 1.0 1.0 1.0 2.0 4.0 NaN NaN NaN 4.0 NaN NaN 4.0 4.0
28146 98828147 Female Loyal Customer 37.0 Personal Travel Eco 2395 0.0 0.0 1 2.0 Ordinary 2.0 2.0 2.0 2.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
30709 98830710 Female Loyal Customer 65.0 Personal Travel Eco 617 3.0 24.0 0 3.0 Ordinary 5.0 3.0 5.0 3.0 4.0 5.0 1.0 NaN 3.0 NaN NaN 1.0 4.0
32086 98832087 Male Loyal Customer 39.0 Business Travel Business 302 87.0 90.0 1 4.0 Ordinary 4.0 4.0 NaN NaN 5.0 4.0 4.0 4.0 4.0 NaN NaN 4.0 4.0
32913 98832914 Female Loyal Customer 53.0 Business Travel Business 1566 16.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 4.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
33809 98833810 Male Loyal Customer 47.0 NaN Eco 1577 0.0 4.0 0 2.0 Green Car 5.0 2.0 2.0 5.0 2.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
34067 98834068 Female Loyal Customer 52.0 NaN Business 547 0.0 0.0 1 5.0 Green Car 5.0 NaN 5.0 4.0 5.0 5.0 4.0 4.0 4.0 NaN NaN 4.0 4.0
34365 98834366 Female Loyal Customer 27.0 Personal Travel Eco 2121 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 5.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 4.0
35112 98835113 Female Loyal Customer 37.0 Personal Travel Eco 1712 0.0 20.0 1 5.0 Green Car 5.0 3.0 5.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
35214 98835215 Male Loyal Customer 44.0 Business Travel Business 2476 6.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 4.0 5.0 NaN NaN 3.0 3.0 NaN NaN 3.0 5.0
35253 98835254 Male Loyal Customer 33.0 Personal Travel Eco 914 16.0 17.0 0 4.0 Green Car 1.0 0.0 3.0 5.0 0.0 4.0 5.0 1.0 2.0 NaN NaN 3.0 5.0
35928 98835929 Female Disloyal Customer 21.0 Business Travel Business 1543 0.0 0.0 1 4.0 Green Car 5.0 4.0 3.0 3.0 4.0 NaN NaN 1.0 5.0 NaN NaN 1.0 3.0
37552 98837553 Female Loyal Customer 51.0 Business Travel Business 1555 0.0 0.0 1 3.0 Green Car 3.0 NaN 3.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
38172 98838173 Female Loyal Customer 53.0 Business Travel Business 449 0.0 3.0 1 2.0 Green Car 5.0 NaN 2.0 3.0 1.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
40893 98840894 Female Disloyal Customer 27.0 Business Travel Eco 1280 19.0 8.0 0 2.0 Ordinary 3.0 2.0 3.0 3.0 2.0 NaN NaN 3.0 2.0 NaN NaN 3.0 3.0
43836 98843837 Female Loyal Customer 13.0 Personal Travel Eco 1880 0.0 0.0 1 4.0 Ordinary 4.0 4.0 4.0 2.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 4.0
45055 98845056 Female Loyal Customer 30.0 NaN Business 4707 1.0 20.0 0 2.0 Ordinary 4.0 4.0 4.0 2.0 2.0 NaN NaN 3.0 1.0 NaN NaN 3.0 2.0
46703 98846704 Female Disloyal Customer 24.0 NaN Eco 2008 130.0 127.0 0 3.0 Green Car 2.0 3.0 3.0 3.0 3.0 NaN NaN 3.0 4.0 NaN NaN 3.0 3.0
47006 98847007 Female Disloyal Customer 25.0 Business Travel Eco 1098 0.0 0.0 0 1.0 Ordinary NaN NaN NaN NaN NaN NaN 3.0 5.0 3.0 NaN NaN 2.0 3.0
50305 98850306 Male Loyal Customer 23.0 Personal Travel Eco 2534 0.0 0.0 0 2.0 Green Car NaN 2.0 3.0 4.0 2.0 NaN NaN 3.0 4.0 NaN NaN 3.0 4.0
50701 98850702 Male Loyal Customer 63.0 Personal Travel Eco 2443 1.0 0.0 0 3.0 Green Car 5.0 4.0 5.0 3.0 4.0 NaN NaN 5.0 4.0 NaN NaN 4.0 3.0
51471 98851472 Male Loyal Customer 73.0 Business Travel Eco 1898 14.0 23.0 0 3.0 Ordinary 3.0 3.0 3.0 3.0 3.0 3.0 3.0 1.0 4.0 NaN NaN 3.0 3.0
51963 98851964 Male Loyal Customer 55.0 Business Travel Eco 2255 0.0 0.0 1 4.0 Ordinary NaN 5.0 5.0 4.0 4.0 NaN NaN 5.0 1.0 NaN NaN 4.0 4.0
52205 98852206 Female Loyal Customer 19.0 Business Travel Business 3622 2.0 19.0 1 5.0 Green Car 5.0 5.0 5.0 4.0 4.0 4.0 4.0 3.0 1.0 NaN NaN 1.0 4.0
52879 98852880 Female Disloyal Customer 25.0 Business Travel Eco 2720 0.0 0.0 0 2.0 Green Car 2.0 2.0 3.0 3.0 2.0 3.0 3.0 4.0 2.0 NaN NaN 4.0 3.0
53370 98853371 Male Loyal Customer 40.0 Personal Travel Eco 2251 0.0 8.0 0 1.0 Green Car 1.0 1.0 1.0 1.0 1.0 NaN NaN 3.0 3.0 NaN NaN 1.0 1.0
53604 98853605 Male Loyal Customer 34.0 Business Travel Business 1856 5.0 0.0 1 1.0 Ordinary 1.0 NaN 1.0 4.0 4.0 5.0 5.0 5.0 5.0 NaN NaN 5.0 4.0
55454 98855455 Male Disloyal Customer 37.0 Business Travel Business 2690 5.0 0.0 0 2.0 Green Car 2.0 2.0 5.0 3.0 2.0 1.0 3.0 NaN 1.0 NaN NaN 1.0 3.0
57218 98857219 Female Loyal Customer 18.0 Personal Travel Eco 1853 36.0 27.0 0 4.0 Green Car 3.0 4.0 4.0 2.0 4.0 3.0 2.0 1.0 1.0 NaN NaN 1.0 2.0
58003 98858004 Female Loyal Customer 25.0 Business Travel Eco 3113 18.0 11.0 1 4.0 Ordinary 5.0 5.0 5.0 4.0 4.0 5.0 4.0 3.0 2.0 NaN NaN 2.0 4.0
58878 98858879 Female Loyal Customer 52.0 Personal Travel Eco 1789 2.0 0.0 1 4.0 Ordinary 5.0 NaN 4.0 5.0 5.0 NaN NaN 2.0 5.0 NaN NaN 2.0 3.0
58982 98858983 Male Disloyal Customer 7.0 Business Travel Eco 2016 22.0 11.0 0 4.0 Green Car 2.0 4.0 4.0 5.0 4.0 NaN NaN 4.0 2.0 NaN NaN 4.0 5.0
59121 98859122 Male Loyal Customer 34.0 Business Travel Business 641 0.0 0.0 1 3.0 Green Car NaN 3.0 3.0 2.0 4.0 5.0 4.0 4.0 4.0 NaN NaN 4.0 4.0
61021 98861022 Male Loyal Customer 39.0 Business Travel Business 1932 14.0 33.0 1 3.0 Green Car 3.0 NaN 3.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
64797 98864798 Female Loyal Customer 38.0 Business Travel Eco 1721 44.0 65.0 1 5.0 Ordinary 5.0 5.0 5.0 5.0 5.0 5.0 5.0 2.0 2.0 NaN NaN 1.0 5.0
66977 98866978 Female Loyal Customer 53.0 Business Travel Business 922 6.0 5.0 1 3.0 Green Car 3.0 3.0 3.0 1.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 2.0
68315 98868316 Female Disloyal Customer 22.0 Business Travel Eco 2237 9.0 0.0 1 2.0 Ordinary 3.0 3.0 3.0 4.0 3.0 NaN NaN 3.0 3.0 NaN NaN 3.0 4.0
68326 98868327 Male Loyal Customer 60.0 Business Travel Eco 2304 42.0 39.0 0 1.0 Ordinary 3.0 3.0 3.0 1.0 1.0 NaN NaN 1.0 2.0 NaN NaN 4.0 1.0
68796 98868797 Female Loyal Customer 54.0 Personal Travel Eco 245 0.0 6.0 1 5.0 Green Car 5.0 5.0 5.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
69242 98869243 Male Loyal Customer 29.0 Business Travel Business 3522 12.0 7.0 1 5.0 Green Car 5.0 5.0 5.0 4.0 4.0 4.0 4.0 5.0 4.0 NaN NaN 5.0 4.0
70149 98870150 Female NaN 41.0 Business Travel Business 497 0.0 0.0 0 2.0 Green Car 4.0 4.0 4.0 4.0 4.0 3.0 2.0 2.0 3.0 NaN NaN 2.0 2.0
70225 98870226 Female Loyal Customer 14.0 Personal Travel Eco 1836 3.0 9.0 0 4.0 Green Car 3.0 4.0 3.0 5.0 4.0 5.0 5.0 1.0 2.0 NaN NaN 5.0 5.0
70627 98870628 Female Loyal Customer 49.0 Business Travel Business 2522 0.0 0.0 1 5.0 Ordinary 5.0 NaN 5.0 5.0 4.0 NaN NaN 5.0 5.0 NaN NaN 5.0 3.0
71691 98871692 Female Loyal Customer 62.0 Business Travel Business 329 10.0 5.0 0 1.0 Ordinary 1.0 1.0 1.0 2.0 2.0 3.0 1.0 1.0 1.0 NaN NaN 1.0 2.0
72395 98872396 Female Disloyal Customer 32.0 Business Travel Eco 2433 38.0 50.0 0 2.0 Ordinary 2.0 2.0 3.0 4.0 2.0 4.0 4.0 4.0 1.0 NaN NaN 4.0 4.0
72549 98872550 Female Loyal Customer 39.0 Business Travel Business 629 0.0 0.0 1 5.0 Green Car 5.0 5.0 5.0 5.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
73825 98873826 Male Loyal Customer 8.0 Business Travel Business 2988 2.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 2.0 2.0 NaN NaN 5.0 5.0 NaN NaN 4.0 2.0
80131 98880132 Male Loyal Customer 52.0 Business Travel Business 3161 0.0 2.0 1 5.0 Green Car 5.0 5.0 5.0 3.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81304 98881305 Male Loyal Customer 50.0 Business Travel Eco 1707 0.0 0.0 1 4.0 Green Car 2.0 2.0 2.0 4.0 4.0 NaN NaN 2.0 5.0 NaN NaN 1.0 4.0
81334 98881335 Male Disloyal Customer 23.0 Business Travel Business 1726 0.0 0.0 1 5.0 Ordinary 0.0 5.0 3.0 4.0 5.0 2.0 4.0 5.0 2.0 NaN NaN 5.0 4.0
81664 98881665 Male Loyal Customer 60.0 Business Travel Business 2212 0.0 0.0 1 1.0 Green Car 1.0 5.0 1.0 5.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81800 98881801 Male Loyal Customer 58.0 Business Travel Business 2009 0.0 0.0 1 1.0 Ordinary 1.0 1.0 1.0 3.0 5.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
81996 98881997 Male Loyal Customer 34.0 Business Travel Business 3210 80.0 73.0 1 1.0 Ordinary 1.0 4.0 1.0 1.0 2.0 NaN NaN 5.0 5.0 NaN NaN 5.0 2.0
82664 98882665 Female Disloyal Customer 58.0 Business Travel Business 2089 0.0 0.0 0 3.0 Green Car NaN 3.0 1.0 2.0 3.0 5.0 2.0 4.0 2.0 NaN NaN 4.0 2.0
83052 98883053 Female Loyal Customer 42.0 Personal Travel Eco 890 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 4.0 4.0 NaN NaN 5.0 3.0 NaN NaN 4.0 4.0
83881 98883882 Male Loyal Customer 11.0 Personal Travel Eco 2883 0.0 0.0 0 2.0 Ordinary 4.0 2.0 1.0 1.0 2.0 NaN NaN 5.0 3.0 NaN NaN 4.0 1.0
84335 98884336 Male Disloyal Customer 36.0 Business Travel Eco 1595 8.0 5.0 0 2.0 Green Car 2.0 1.0 NaN NaN 1.0 5.0 5.0 4.0 5.0 NaN NaN 4.0 5.0
84798 98884799 Female Loyal Customer 57.0 Business Travel Business 3931 58.0 64.0 1 5.0 Ordinary 5.0 NaN 5.0 5.0 4.0 NaN NaN 4.0 4.0 NaN NaN 4.0 5.0
86180 98886181 Male Loyal Customer 44.0 Business Travel Eco 1574 0.0 0.0 1 5.0 Green Car NaN 2.0 2.0 5.0 5.0 NaN NaN 1.0 5.0 NaN NaN 1.0 5.0
87129 98887130 Male Loyal Customer 34.0 Business Travel Business 3925 5.0 33.0 1 5.0 Ordinary 5.0 5.0 5.0 5.0 3.0 NaN NaN NaN 5.0 NaN NaN 5.0 1.0
94141 98894142 Male Loyal Customer 29.0 Business Travel Business 2700 0.0 0.0 1 3.0 Green Car 3.0 3.0 3.0 5.0 5.0 5.0 5.0 5.0 4.0 NaN NaN 5.0 5.0

Cleanliness¶

In [ ]:
labeled_countplot(train,'Cleanliness', perc = True, order = True)
Number of null values:  6
In [ ]:
train.loc[train['Cleanliness'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
3210 98803211 Male Loyal Customer 24.0 Personal Travel Eco 1473 11.0 1.0 1 0.0 Green Car 1.0 0.0 3.0 5.0 0.0 5.0 5.0 1.0 1.0 4.0 4.0 NaN NaN
29045 98829046 Male Loyal Customer 37.0 Personal Travel Eco 1887 0.0 0.0 1 0.0 Green Car 1.0 0.0 3.0 1.0 0.0 1.0 1.0 2.0 5.0 3.0 1.0 NaN NaN
48087 98848088 Male Loyal Customer 42.0 Personal Travel Eco 2311 0.0 0.0 1 0.0 Green Car NaN 0.0 3.0 1.0 0.0 1.0 1.0 1.0 4.0 3.0 2.0 NaN NaN
65681 98865682 Male Loyal Customer 34.0 Personal Travel Eco 1816 0.0 0.0 1 0.0 Green Car 1.0 0.0 3.0 4.0 0.0 4.0 4.0 1.0 1.0 2.0 3.0 NaN NaN
79256 98879257 Male Loyal Customer 24.0 Personal Travel Eco 1826 0.0 0.0 1 0.0 Ordinary NaN 0.0 3.0 3.0 0.0 2.0 3.0 2.0 2.0 3.0 4.0 NaN NaN
88733 98888734 Female Loyal Customer 29.0 Personal Travel Eco 1918 16.0 5.0 1 0.0 Green Car 1.0 0.0 3.0 5.0 0.0 5.0 5.0 1.0 3.0 3.0 3.0 NaN NaN

Online Boarding¶

In [ ]:
labeled_countplot(train,'Online_Boarding', perc = True, order = True)
Number of null values:  6
In [ ]:
train.loc[train['Online_Boarding'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
3210 98803211 Male Loyal Customer 24.0 Personal Travel Eco 1473 11.0 1.0 1 0.0 Green Car 1.0 0.0 3.0 5.0 0.0 5.0 5.0 1.0 1.0 4.0 4.0 NaN NaN
29045 98829046 Male Loyal Customer 37.0 Personal Travel Eco 1887 0.0 0.0 1 0.0 Green Car 1.0 0.0 3.0 1.0 0.0 1.0 1.0 2.0 5.0 3.0 1.0 NaN NaN
48087 98848088 Male Loyal Customer 42.0 Personal Travel Eco 2311 0.0 0.0 1 0.0 Green Car NaN 0.0 3.0 1.0 0.0 1.0 1.0 1.0 4.0 3.0 2.0 NaN NaN
65681 98865682 Male Loyal Customer 34.0 Personal Travel Eco 1816 0.0 0.0 1 0.0 Green Car 1.0 0.0 3.0 4.0 0.0 4.0 4.0 1.0 1.0 2.0 3.0 NaN NaN
79256 98879257 Male Loyal Customer 24.0 Personal Travel Eco 1826 0.0 0.0 1 0.0 Ordinary NaN 0.0 3.0 3.0 0.0 2.0 3.0 2.0 2.0 3.0 4.0 NaN NaN
88733 98888734 Female Loyal Customer 29.0 Personal Travel Eco 1918 16.0 5.0 1 0.0 Green Car 1.0 0.0 3.0 5.0 0.0 5.0 5.0 1.0 3.0 3.0 3.0 NaN NaN

10% Missing Value¶

Arrival Time Convenient¶

In [ ]:
labeled_countplot(train,'Arrival_Time_Convenient', perc = True, order = True)
Number of null values:  8930
In [ ]:
missing_time = train.loc[train['Arrival_Time_Convenient'].isnull() == True]
missing_time
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
7 98800008 Male Loyal Customer 65.0 Personal Travel Business 853 0.0 3.0 0 3.0 Green Car NaN 3.0 1.0 5.0 5.0 4.0 4.0 4.0 3.0 4.0 4.0 4.0 5.0
12 98800013 Male Loyal Customer 44.0 NaN Business 427 0.0 0.0 1 3.0 Ordinary NaN 5.0 3.0 2.0 4.0 5.0 5.0 5.0 5.0 5.0 4.0 5.0 4.0
16 98800017 Female Disloyal Customer 9.0 Business Travel Eco 2064 14.0 1.0 0 2.0 Ordinary NaN 2.0 3.0 4.0 2.0 4.0 4.0 3.0 1.0 2.0 4.0 2.0 4.0
29 98800030 Male Loyal Customer 54.0 Business Travel Business 1596 0.0 0.0 1 2.0 Green Car NaN 2.0 2.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 4.0 4.0
33 98800034 Male Disloyal Customer 22.0 Business Travel Business 2515 42.0 30.0 1 5.0 Ordinary NaN 5.0 2.0 1.0 5.0 1.0 1.0 3.0 2.0 2.0 1.0 4.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94318 98894319 Male Disloyal Customer 25.0 Business Travel Eco 1806 21.0 23.0 0 3.0 Ordinary NaN 3.0 1.0 5.0 3.0 4.0 5.0 3.0 3.0 3.0 2.0 4.0 5.0
94322 98894323 Male Loyal Customer 63.0 Personal Travel Eco 1645 60.0 58.0 0 3.0 Ordinary NaN 0.0 1.0 1.0 0.0 1.0 1.0 5.0 4.0 5.0 4.0 5.0 1.0
94329 98894330 Male Disloyal Customer 28.0 Business Travel Eco 2035 0.0 8.0 0 2.0 Green Car NaN 2.0 2.0 1.0 2.0 1.0 1.0 1.0 4.0 2.0 1.0 3.0 1.0
94345 98894346 Male Loyal Customer 29.0 Business Travel Business 3638 1.0 0.0 0 1.0 Ordinary NaN 5.0 5.0 1.0 1.0 1.0 1.0 2.0 1.0 3.0 2.0 3.0 1.0
94377 98894378 Male Loyal Customer 16.0 Personal Travel Eco 2744 0.0 0.0 0 2.0 Ordinary NaN 2.0 4.0 4.0 2.0 4.0 4.0 3.0 4.0 4.0 4.0 5.0 4.0

8930 rows × 25 columns

In [ ]:
#Studying if the location of platform affects the arrival time convenient
labeled_countplot(missing_time,'Platform_Location', perc = True, order = True)
Number of null values:  19

Onboard Service¶

In [ ]:
labeled_countplot(train,'Onboard_Service', perc = True, order = True)
Number of null values:  7601
In [ ]:
train.loc[train['Onboard_Service'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
30 98800031 Male Loyal Customer 9.0 NaN Eco 2379 100.0 93.0 0 3.0 Green Car 3.0 3.0 3.0 4.0 3.0 5.0 4.0 NaN 4.0 3.0 2.0 4.0 4.0
51 98800052 Female Loyal Customer 26.0 Business Travel Business 4560 0.0 7.0 0 1.0 Ordinary 0.0 0.0 3.0 1.0 0.0 1.0 1.0 NaN 1.0 4.0 3.0 4.0 1.0
69 98800070 Female Loyal Customer 56.0 Personal Travel Eco 284 57.0 52.0 1 3.0 Green Car 3.0 3.0 3.0 5.0 5.0 4.0 5.0 NaN 5.0 5.0 5.0 5.0 3.0
76 98800077 Female Loyal Customer 42.0 NaN Eco 470 2.0 23.0 1 0.0 Green Car 1.0 0.0 2.0 3.0 2.0 2.0 3.0 NaN 0.0 3.0 1.0 3.0 4.0
88 98800089 Male Disloyal Customer 46.0 Personal Travel Eco 1708 0.0 0.0 0 2.0 Ordinary 4.0 2.0 3.0 2.0 2.0 2.0 2.0 NaN 5.0 5.0 3.0 5.0 2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94320 98894321 Male Loyal Customer 39.0 Personal Travel Eco 2112 0.0 9.0 0 3.0 Green Car 5.0 3.0 3.0 3.0 3.0 4.0 3.0 NaN 4.0 4.0 4.0 5.0 3.0
94330 98894331 Female Loyal Customer 32.0 Business Travel Business 3058 5.0 0.0 1 4.0 Green Car 4.0 4.0 4.0 5.0 4.0 5.0 5.0 NaN 5.0 5.0 3.0 5.0 5.0
94348 98894349 Female Loyal Customer 14.0 Personal Travel Eco 2727 2.0 2.0 1 4.0 Ordinary 4.0 4.0 4.0 4.0 4.0 3.0 4.0 NaN 4.0 4.0 2.0 4.0 2.0
94351 98894352 Female Loyal Customer 56.0 Business Travel Business 3325 0.0 0.0 1 0.0 Green Car 0.0 0.0 2.0 5.0 5.0 5.0 5.0 NaN 5.0 5.0 3.0 5.0 5.0
94358 98894359 Female Loyal Customer 31.0 Business Travel Business 2835 4.0 1.0 1 4.0 Ordinary 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN 1.0 2.0 2.0 3.0 4.0

7601 rows × 25 columns

Catering¶

In [ ]:
labeled_countplot(train,'Catering', perc = True, order = True)
Number of null values:  8741
In [ ]:
train.loc[train['Catering'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
3 98800004 Female Loyal Customer 44.0 Business Travel Business 780 13.0 18.0 0 3.0 Ordinary 2.0 NaN 2.0 3.0 2.0 3.0 3.0 3.0 3.0 3.0 4.0 3.0 3.0
31 98800032 Male Loyal Customer 76.0 Business Travel Business 285 0.0 0.0 1 0.0 Ordinary 0.0 NaN 3.0 3.0 4.0 3.0 2.0 2.0 2.0 2.0 4.0 2.0 1.0
40 98800041 Female Disloyal Customer 44.0 Business Travel Business 1388 11.0 0.0 0 3.0 Green Car 3.0 NaN 2.0 1.0 3.0 1.0 1.0 3.0 3.0 4.0 5.0 4.0 1.0
41 98800042 Female Loyal Customer 43.0 Business Travel Business 1232 0.0 0.0 1 2.0 Ordinary 2.0 NaN 2.0 4.0 5.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0
47 98800048 Female Loyal Customer 61.0 Business Travel Eco 61 62.0 64.0 1 5.0 Ordinary 5.0 NaN 5.0 2.0 5.0 4.0 5.0 5.0 5.0 5.0 5.0 5.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94327 98894328 Female Disloyal Customer 22.0 Business Travel Eco 1808 0.0 0.0 1 5.0 Ordinary 5.0 NaN 2.0 1.0 5.0 1.0 1.0 5.0 4.0 4.0 4.0 5.0 1.0
94332 98894333 Female Disloyal Customer 22.0 Business Travel Eco 2105 0.0 0.0 0 4.0 Ordinary 4.0 NaN 4.0 3.0 4.0 3.0 3.0 5.0 4.0 5.0 4.0 5.0 3.0
94335 98894336 Female Loyal Customer 65.0 Business Travel Business 3183 0.0 0.0 0 2.0 Green Car 1.0 NaN 1.0 2.0 3.0 4.0 2.0 2.0 2.0 2.0 3.0 2.0 3.0
94343 98894344 Male Loyal Customer 50.0 Personal Travel Eco 2306 0.0 0.0 0 2.0 Green Car 4.0 NaN 1.0 1.0 2.0 1.0 1.0 1.0 5.0 3.0 5.0 4.0 1.0
94361 98894362 Female Loyal Customer 41.0 Business Travel Business 1998 0.0 0.0 1 5.0 Ordinary 5.0 NaN 5.0 3.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0

8741 rows × 25 columns

Customer type¶

In [ ]:
labeled_countplot(train,'Customer_Type', perc = True, order = False)
Number of null values:  8951
In [ ]:
train.loc[train['Customer_Type'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
32 98800033 Male NaN 30.0 Business Travel Business 3357 11.0 0.0 1 2.0 Green Car 2.0 2.0 2.0 5.0 5.0 5.0 5.0 4.0 4.0 4.0 5.0 5.0 5.0
34 98800035 Male NaN 41.0 Business Travel Business 1724 8.0 20.0 1 2.0 Ordinary 2.0 2.0 2.0 2.0 4.0 4.0 4.0 4.0 4.0 4.0 3.0 4.0 4.0
38 98800039 Male NaN 46.0 Personal Travel Eco 2608 0.0 0.0 0 3.0 Ordinary 4.0 3.0 4.0 1.0 3.0 1.0 1.0 3.0 3.0 5.0 4.0 5.0 1.0
49 98800050 Female NaN 25.0 Business Travel Business 1784 0.0 0.0 1 4.0 Ordinary 4.0 NaN 3.0 2.0 4.0 2.0 2.0 5.0 5.0 4.0 4.0 5.0 2.0
56 98800057 Female NaN 39.0 Business Travel Business 2016 0.0 14.0 1 4.0 Ordinary 4.0 3.0 4.0 4.0 5.0 4.0 5.0 5.0 5.0 4.0 4.0 5.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94337 98894338 Female NaN 31.0 Business Travel Business 2480 22.0 0.0 0 3.0 Green Car 3.0 3.0 4.0 4.0 3.0 4.0 4.0 4.0 4.0 3.0 3.0 4.0 4.0
94340 98894341 Female NaN 34.0 Personal Travel Eco 1754 0.0 0.0 1 2.0 Ordinary 2.0 2.0 2.0 5.0 5.0 4.0 5.0 5.0 5.0 5.0 4.0 5.0 3.0
94342 98894343 Female NaN 37.0 Business Travel Business 1623 0.0 7.0 1 3.0 Green Car 3.0 3.0 3.0 4.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
94344 98894345 Female NaN 49.0 Business Travel Business 272 2.0 0.0 1 2.0 Green Car 2.0 2.0 2.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0 3.0 5.0 3.0
94376 98894377 Male NaN 63.0 Business Travel Business 2794 0.0 0.0 1 2.0 Green Car 2.0 2.0 2.0 4.0 5.0 4.0 4.0 4.0 4.0 4.0 3.0 4.0 3.0

8951 rows × 25 columns

Observations:

  1. 74% is loyal customer, and 16.5% is disloyal customer
  2. 8951 null values, quite a large percentage
  3. Unable to find any obvious patterns to explain for causes of missing values

Travel Type¶

In [ ]:
labeled_countplot(train,'Type_Travel', perc = True, order = False)
Number of null values:  9226
In [ ]:
train.loc[train['Type_Travel'].isnull() == True]
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
0 98800001 Female Loyal Customer 52.0 NaN Business 272 0.0 5.0 0 2.0 Green Car 5.0 5.0 5.0 4.0 2.0 3.0 2.0 2.0 3.0 2.0 4.0 2.0 1.0
12 98800013 Male Loyal Customer 44.0 NaN Business 427 0.0 0.0 1 3.0 Ordinary NaN 5.0 3.0 2.0 4.0 5.0 5.0 5.0 5.0 5.0 4.0 5.0 4.0
15 98800016 Female Loyal Customer 54.0 NaN Business 2827 0.0 0.0 1 5.0 Ordinary 5.0 5.0 5.0 2.0 5.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
30 98800031 Male Loyal Customer 9.0 NaN Eco 2379 100.0 93.0 0 3.0 Green Car 3.0 3.0 3.0 4.0 3.0 5.0 4.0 NaN 4.0 3.0 2.0 4.0 4.0
39 98800040 Male Loyal Customer 35.0 NaN Eco 1818 18.0 2.0 0 4.0 Ordinary 1.0 3.0 2.0 1.0 3.0 1.0 1.0 4.0 1.0 2.0 4.0 3.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94273 98894274 Male Loyal Customer 68.0 NaN Eco 1871 18.0 0.0 0 2.0 Ordinary 4.0 2.0 4.0 5.0 2.0 5.0 5.0 3.0 1.0 4.0 3.0 3.0 5.0
94288 98894289 Female Loyal Customer 11.0 NaN Eco 1789 22.0 18.0 0 4.0 Ordinary 2.0 NaN 3.0 5.0 4.0 5.0 5.0 1.0 1.0 4.0 4.0 4.0 5.0
94307 98894308 Female Loyal Customer 18.0 NaN Business 1911 0.0 6.0 1 5.0 Ordinary 5.0 5.0 5.0 4.0 4.0 4.0 4.0 4.0 2.0 5.0 5.0 5.0 4.0
94328 98894329 Female Loyal Customer 52.0 NaN Eco 2160 211.0 213.0 0 2.0 Green Car 5.0 2.0 4.0 2.0 3.0 3.0 5.0 4.0 5.0 5.0 3.0 3.0 3.0
94378 98894379 Male Loyal Customer 54.0 NaN Eco 2107 28.0 28.0 0 3.0 Ordinary 1.0 3.0 3.0 3.0 3.0 3.0 3.0 1.0 4.0 4.0 1.0 4.0 3.0

9226 rows × 25 columns

Observations:

Multivariate Analysis¶

Numerical Variables¶

In [ ]:
#Retrieving the names of numerical columns and categorical columns
num_cols2 = train._get_numeric_data().columns
cat_cols2 = train.select_dtypes(exclude='number').columns
In [ ]:
# Plotting the correlation between numerical variables
plt.figure(figsize=(15,8))
sns.heatmap(train.corr())
# Plotting the correlation between numerical variables
plt.figure(figsize=(15,8))
sns.heatmap(train[['Seat_Comfort',
       'Arrival_Time_Convenient', 'Catering', 'Platform_Location',
       'Onboard_Wifi_Service', 'Onboard_Entertainment', 'Online_Support',
       'Ease_of_Online_Booking', 'Onboard_Service', 'Legroom',
       'Baggage_Handling', 'CheckIn_Service', 'Cleanliness',
       'Online_Boarding','Age','Travel_Distance','Departure_Delay_in_Mins','Arrival_Delay_in_Mins','Platform_Location']].corr(),annot=True, fmt='0.2f', cmap='YlGnBu')
Out[ ]:
<Axes: >

Do business passengers always take business class?¶

In [ ]:
sns.countplot(train,x = 'Type_Travel',hue = 'Travel_Class')
Out[ ]:
<Axes: xlabel='Type_Travel', ylabel='count'>

Relationship between seat class, legroom and seat comfort?¶

Seeing if Seat_Class affects rating distribution of Legroom and Seat_Comfort¶

In [ ]:
sns.histplot(train, x = 'Legroom', hue = 'Seat_Class',discrete = True)
Out[ ]:
<Axes: xlabel='Legroom', ylabel='Count'>
In [ ]:
sns.histplot(train, x = 'Seat_Comfort', hue = 'Seat_Class', discrete = True)
Out[ ]:
<Axes: xlabel='Seat_Comfort', ylabel='Count'>

Observations:

  1. Green car consistently have better rating than ordinary class seats.
  2. Ordinary car seats consistently have more poor rating than green car class seats.

How do they affect overall experience?¶

In [ ]:
stacked_barplot(train,'Seat_Class','Overall_Experience')
Overall_Experience      0      1    All
Seat_Class                             
All                 42786  51593  94379
Green Car           21434  26001  47435
Ordinary            21352  25592  46944
------------------------------------------------------------------------------------------------------------------------
In [ ]:
stacked_barplot(train,'Seat_Comfort','Overall_Experience')
Overall_Experience      0      1    All
Seat_Comfort                           
All                 42757  51561  94318
3.0                 13669   7489  21158
2.0                 13464   7482  20946
1.0                  8339   6846  15185
4.0                  7181  13414  20595
5.0                    96  12875  12971
0.0                     8   3455   3463
------------------------------------------------------------------------------------------------------------------------
In [ ]:
stacked_barplot(train,'Legroom','Overall_Experience')
Overall_Experience      0      1    All
Legroom                                
All                 42750  51539  94289
3.0                 10321   6063  16384
2.0                  9814   5939  15753
4.0                  9488  19382  28870
5.0                  7245  17587  24832
1.0                  5776   2334   8110
0.0                   106    234    340
------------------------------------------------------------------------------------------------------------------------
In [ ]:
stacked_barplot(train,'Legroom','Overall_Experience')
Overall_Experience      0      1    All
Legroom                                
All                 42750  51539  94289
3.0                 10321   6063  16384
2.0                  9814   5939  15753
4.0                  9488  19382  28870
5.0                  7245  17587  24832
1.0                  5776   2334   8110
0.0                   106    234    340
------------------------------------------------------------------------------------------------------------------------
In [ ]:
sns.countplot(train, x = 'Gender', hue = 'Overall_Experience')
Out[ ]:
<Axes: xlabel='Gender', ylabel='count'>

Women tend to appreciate more the experience than men. Around 65% of the Women rate the travel as a good experience and around 45% for the men.

Observations:

  1. Seems like it doesn’t really make a significant difference whether the seat class is green car or ordinary. Customers are only slightly more satisfied overall on the green car.
  2. This may mean that seat_class is not a very good indicator of overall experience
  3. Even though there are 3463 people rating an extremely poor seat comfort, 3455 of them are satisfied with the overall experience.
  4. A higher ratio of passengers who rated a poor and below seat comfort gave a good overall experience as compared to that of those who finds the seat comfort acceptable or needs improvement.
  5. This is counter-intuitive and perhaps seat_comfort is not a good indicator of overall experience.

Missing value and outliers treatment¶

Gender¶

Women tend to appreciate more the experience than men. Around 65% of the Women rate the travel as a good experience and around 45% for the men. We know the gender of 99.92% of the passengers, only 77 values are missing for this feature. Generally when we impute missing value for categorical variables we use the mode.

In [ ]:
#substitute missing values with mode
train['Gender'] = train['Gender'].apply(lambda x: 'Female' if x!='Female' and x!='Male' else x )

Customer_Type¶

Customer_Type

The most of customers are Loyal, they are 81.7% over the total passengers only 18.2% are disloyal. Customer type miss a lot of values 9.48%, not a huge number but we have to impute carfully the missing values becuase we might distort the distribution. The difference is high between loyal and disloyal customer, so we can impute the mode without second thoughts. To check if the distribution didn't change we can compute again the percentage of loyal and disloyal customer

In [ ]:
#check how the distribution change before and after the imputation
train['Customer_Type'].value_counts(normalize=True)
Out[ ]:
Loyal Customer       0.817332
Disloyal Customer    0.182668
Name: Customer_Type, dtype: float64
In [ ]:
stacked_barplot(train,'Customer_Type','Overall_Experience')
Overall_Experience      0      1    All
Customer_Type                          
All                 38663  46765  85428
Loyal Customer      26794  43029  69823
Disloyal Customer   11869   3736  15605
------------------------------------------------------------------------------------------------------------------------
In [ ]:
stacked_barplot(train,'Customer_Type','Onboard_Entertainment')
Onboard_Entertainment   0.0   1.0    2.0    3.0    4.0    5.0    All
Customer_Type                                                       
All                    1957  7751  12640  15858  27587  19618  85411
Loyal Customer         1290  5336   8779  11933  24480  17990  69808
Disloyal Customer       667  2415   3861   3925   3107   1628  15603
------------------------------------------------------------------------------------------------------------------------
In [ ]:
#substitute missing values with mode
train['Customer_Type'] = train['Customer_Type'].apply(lambda x: 'Loyal Customer' if x!='Loyal Customer' and x!='Disloyal Customer' else x )
In [ ]:
#check how the distribution change after the imputation
train['Customer_Type'].value_counts(normalize=True)
Out[ ]:
Loyal Customer       0.834656
Disloyal Customer    0.165344
Name: Customer_Type, dtype: float64

Now the percentage changes 83.4% of Loyal Customer and 16.5 Disloyal, we are in order of 2% so is not a huge difference comperad with before.

Type_Travel¶

The most of the passenger's type_travel is business travel, they are 68.8% and 31.1% personal travel. The number of missing values is 9226, 9.77% of the total. The difference here is big so we can substitute with the mode

In [ ]:
train['Type_Travel'].value_counts(normalize = True)
Out[ ]:
Business Travel    0.688373
Personal Travel    0.311627
Name: Type_Travel, dtype: float64
In [ ]:
stacked_barplot(train,'Type_Travel','Overall_Experience')
Overall_Experience      0      1    All
Type_Travel                            
All                 38600  46553  85153
Business Travel     24441  34176  58617
Personal Travel     14159  12377  26536
------------------------------------------------------------------------------------------------------------------------
In [ ]:
#substitute missing values with mode
train['Type_Travel'] = train['Type_Travel'].apply(lambda x: 'Business Travel' if x!='Business Travel' and x!='Personal Travel' else x )

Numerical variables range 0-5¶

The variables which range between 0 and 5 will not have outliers. I'm confident is not matematically possible to have outliers here. However if the distribution is skewed it's better to impute this value with the median and not the mean. For these variables is better to have a look at the counplots before to decide in favor of mean instead of median. Unless the distribution is highly skewed we can use the mean to impute missing values. Otherwise we will use the mode.

In [ ]:
#create as many count plot as the number of appreciation variables
fig, axes = plt.subplots(14, 1, figsize = (20, 75))
fig.suptitle('Box plot for appreciation variabless')

#Add Platform_Location into appreciation variables
appreciation_variables2 = appreciation_variables +['Platform_Location']
print(appreciation_variables2)

for i,column in enumerate(appreciation_variables2):
    sns.countplot(data=train, x=column, ax = axes[i]);
['Seat_Comfort', 'Arrival_Time_Convenient', 'Catering', 'Onboard_Wifi_Service', 'Onboard_Entertainment', 'Online_Support', 'Ease_of_Online_Booking', 'Onboard_Service', 'Legroom', 'Baggage_Handling', 'CheckIn_Service', 'Cleanliness', 'Online_Boarding', 'Platform_Location']

Seat comfort is mostly distributed with around 2,3,and 4, less observations for 5,1 and 0. As I can see from the count plot, the observations are quite omogenuosly distributed so we can use the mean. we can apply the same logic to catering, Arrival time convenitent, platform location, onboard wi-fi service, Checkin sevice, online boarding

In [ ]:
# finding the mean and imputing the missing data with it
for column in ['Seat_Comfort','Catering', 'Arrival_Time_Convenient', 'Platform_Location','Onboard_Wifi_Service', 'CheckIn_Service', 'Online_Boarding']:
    train[column] = train[column].fillna(train[column].mean())

Looking at the count plots of onboard entertainment, Online support, Ease of online booking, onboard service, Legroom, Baggage Handling, Cleanless, is visible that the mode frequency is very high, so it's more indicate to substitute with it

In [ ]:
# finding the mode and imputing the missing data with it
for column in ['Onboard_Entertainment', 'Online_Support', 'Ease_of_Online_Booking', 'Onboard_Service', 'Legroom', 'Baggage_Handling', 'Cleanliness']:
    train[column] = train[column].fillna(train[column].mode()[0])

Other numerical variables¶

For the other numerical variables we have to choose between median and mean to impute missing values. When there are outliers and we think that the mean can be affted by, we will impute the median instead. Age doesn't have outliers, we can impute the values with the mean.

In [ ]:
#substitute missing values with mean
train['Age'] = train['Age'].fillna(train['Age'].mean())

Departure and arrival delay are higly skewed on the right, there is a huge number of outliers, this distribution is significally affected by them, so it'd be better to impute the value with the median. But thinking of this 2 variables together, they can be estimation of one another. In average their difference is 0.43 minutes. The rest of the missing values we will impute with the median.

In [ ]:
train['diff'] = train['Arrival_Delay_in_Mins'] - train['Departure_Delay_in_Mins']
In [ ]:
train['diff'].mean()
Out[ ]:
0.43188828146603986
In [ ]:
train.drop('diff',axis=1,inplace=True)
In [ ]:
train.head(1)
Out[ ]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
0 98800001 Female Loyal Customer 52.0 Business Travel Business 272 0.0 5.0 0 2.0 Green Car 5.0 5.0 5.0 4.0 2.0 3.0 2.0 2.0 3.0 2.0 4.0 2.0 1.0
In [ ]:
#substitute arrival delay with departure delay and viceversa
for i in list(range(train.shape[0])):
    if pd.isna(train.iloc[i]['Arrival_Delay_in_Mins']):
        if pd.isna(train.iloc[i]['Departure_Delay_in_Mins']):
            train.at[i,'Arrival_Delay_in_Mins']=train['Arrival_Delay_in_Mins'].median()
            train.at[i,'Departure_Delay_in_Mins']=train['Departure_Delay_in_Mins'].median()
        else:
            train.at[i,'Arrival_Delay_in_Mins'] = train.iloc[i]['Departure_Delay_in_Mins']
    else:
        if pd.isna(train.iloc[i]['Departure_Delay_in_Mins']):
            train.at[i,'Departure_Delay_in_Mins'] = train.iloc[i]['Arrival_Delay_in_Mins']
        else:
            pass

We also need to treat the outliers of departure delay and arrival delay. we can substitute them with the interquartile range

In [ ]:
def treat_outliers(df, col):
    """
    treats outliers in a variable
    col: str, name of the numerical variable
    df: dataframe
    col: name of the column
    """
    Q1 = df[col].quantile(0.25)  # 25th quantile
    Q3 = df[col].quantile(0.75)  # 75th quantile
    IQR = Q3 - Q1                # Inter Quantile Range (75th perentile - 25th percentile)
    lower_whisker = Q1 - 1.5 * IQR
    upper_whisker = Q3 + 1.5 * IQR

    # all the values smaller than lower_whisker will be assigned the value of lower_whisker
    # all the values greater than upper_whisker will be assigned the value of upper_whisker
    # the assignment will be done by using the clip function of NumPy
    df[col] = np.clip(df[col], lower_whisker, upper_whisker)

    return df
In [ ]:
data = treat_outliers(train,'Departure_Delay_in_Mins')
data = treat_outliers(train,'Arrival_Delay_in_Mins')
In [ ]:
histogram_boxplot(train,'Departure_Delay_in_Mins')

Data Pre-processing¶

Testing Datasets¶

The testing dataset do not have overall experience column, so it is purely for predicting after we have built our model. Thereofore, we will only be processing the dataset make sure the dataset has the same format as the training dataset.

In [ ]:
#Checking if the tavel and survey training data have same IDs
if traveldata_test['ID'].nunique()==surveydata_test['ID'].nunique():
    print(f"the unique ids are the same number")
    n_passengers = traveldata_test['ID'].nunique()
    print(f"there are {n_passengers} passengers in total")
the unique ids are the same number
there are 35602 passengers in total
In [ ]:
#merge dataframes
test = pd.merge(traveldata_test,surveydata_test,how='inner',on='ID')
if n_passengers == test['ID'].nunique():
    print('merge is succesfull, all passengers are in the final dataframe')
merge is succesfull, all passengers are in the final dataframe
In [ ]:
#Converting all features with satisfactory scales to numerical variables
for column in appreciation_variables:
    test[column] = test[column].apply(cat_to_numerical)

#Converting Platform_Location to numerical variables
test['Platform_Location'].replace({'Very Convenient': 5,
                                    'Convenient': 4,
                                    'Manageable': 3,
                                    'Needs Improvement': 2,
                                    'Inconvenient': 1,
                                    'Very Inconvenient': 0}, inplace=True)
In [ ]:
#remove ID column we don't need it for the model
test.drop('ID',axis=1,inplace=True)
In [ ]:
# Creating dummy variables for the categorical columns
test_data = pd.get_dummies(test,
                      columns = data.select_dtypes(include = ["object", "category"]).columns.tolist(),
                      drop_first = True) #Only apply this function to object and category variables
In [ ]:
test_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 35602 entries, 0 to 35601
Data columns (total 23 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Age                           35591 non-null  float64
 1   Travel_Distance               35602 non-null  int64  
 2   Departure_Delay_in_Mins       35573 non-null  float64
 3   Arrival_Delay_in_Mins         35479 non-null  float64
 4   Seat_Comfort                  35580 non-null  float64
 5   Arrival_Time_Convenient       32277 non-null  float64
 6   Catering                      32245 non-null  float64
 7   Platform_Location             35590 non-null  float64
 8   Onboard_Wifi_Service          35590 non-null  float64
 9   Onboard_Entertainment         35594 non-null  float64
 10  Online_Support                35576 non-null  float64
 11  Ease_of_Online_Booking        35584 non-null  float64
 12  Onboard_Service               32730 non-null  float64
 13  Legroom                       35577 non-null  float64
 14  Baggage_Handling              35562 non-null  float64
 15  CheckIn_Service               35580 non-null  float64
 16  Cleanliness                   35600 non-null  float64
 17  Online_Boarding               35600 non-null  float64
 18  Gender_Male                   35602 non-null  uint8  
 19  Customer_Type_Loyal Customer  35602 non-null  uint8  
 20  Type_Travel_Personal Travel   35602 non-null  uint8  
 21  Travel_Class_Eco              35602 non-null  uint8  
 22  Seat_Class_Ordinary           35602 non-null  uint8  
dtypes: float64(17), int64(1), uint8(5)
memory usage: 5.3 MB

Training Datasets¶

In [ ]:
#remove ID column we don't need it for the model
train.drop('ID',axis=1,inplace=True)
In [ ]:
# Creating dummy variables for the categorical columns
train_data = pd.get_dummies(train,
                      columns = data.select_dtypes(include = ["object", "category"]).columns.tolist(),
                      drop_first = True) #Only apply this function to object and category variables
In [ ]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 94379 entries, 0 to 94378
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Age                           94379 non-null  float64
 1   Travel_Distance               94379 non-null  int64  
 2   Departure_Delay_in_Mins       94379 non-null  float64
 3   Arrival_Delay_in_Mins         94379 non-null  float64
 4   Overall_Experience            94379 non-null  int64  
 5   Seat_Comfort                  94379 non-null  float64
 6   Arrival_Time_Convenient       94379 non-null  float64
 7   Catering                      94379 non-null  float64
 8   Platform_Location             94379 non-null  float64
 9   Onboard_Wifi_Service          94379 non-null  float64
 10  Onboard_Entertainment         94379 non-null  float64
 11  Online_Support                94379 non-null  float64
 12  Ease_of_Online_Booking        94379 non-null  float64
 13  Onboard_Service               94379 non-null  float64
 14  Legroom                       94379 non-null  float64
 15  Baggage_Handling              94379 non-null  float64
 16  CheckIn_Service               94379 non-null  float64
 17  Cleanliness                   94379 non-null  float64
 18  Online_Boarding               94379 non-null  float64
 19  Gender_Male                   94379 non-null  uint8  
 20  Customer_Type_Loyal Customer  94379 non-null  uint8  
 21  Type_Travel_Personal Travel   94379 non-null  uint8  
 22  Travel_Class_Eco              94379 non-null  uint8  
 23  Seat_Class_Ordinary           94379 non-null  uint8  
dtypes: float64(17), int64(2), uint8(5)
memory usage: 16.9 MB
In [ ]:
#Saving the independent variables in x
x = train_data.drop('Overall_Experience',axis=1)

#Saving dependent variables in y
y = train_data['Overall_Experience']

# Y_train = train['Overall_Experience']
# X_train = train[['Seat_Comfort',
#        'Arrival_Time_Convenient', 'Catering', 'Platform_Location',
#        'Onboard_Wifi_Service', 'Onboard_Entertainment', 'Online_Support',
#        'Ease_of_Online_Booking', 'Onboard_Service', 'Legroom',
#        'Baggage_Handling', 'CheckIn_Service', 'Cleanliness',
#        'Online_Boarding','Gender','Customer_Type','Type_Travel','Travel_Class','Seat_Class','Age','Travel_Distance','Departure_Delay_in_Mins','Arrival_Delay_in_Mins']]

Split and Scale Data¶

In [ ]:
# Splitting the dataset into train and test datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True, random_state = 1)
In [ ]:
print("Shape of Training set : ", x_train.shape)
print("Shape of test set : ", x_test.shape)
Shape of Training set :  (75503, 23)
Shape of test set :  (18876, 23)
In [ ]:
x_train.head()
Out[ ]:
Age Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Seat_Comfort Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding Gender_Male Customer_Type_Loyal Customer Type_Travel_Personal Travel Travel_Class_Eco Seat_Class_Ordinary
17535 35.0 1942 30.0 32.5 3.0 4.0 3.0 4.0 5.0 3.0 1.0 5.0 1.0 1.0 3.0 1.0 3.0 5.0 1 1 1 1 0
69574 16.0 1686 0.0 14.0 3.0 2.0 3.0 3.0 4.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 3.0 4.0 1 0 0 1 0
45312 28.0 1978 0.0 0.0 1.0 0.0 1.0 4.0 2.0 1.0 2.0 2.0 4.0 1.0 4.0 1.0 4.0 2.0 1 0 0 1 1
40906 54.0 460 0.0 0.0 2.0 2.0 2.0 2.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.0 0 1 0 0 0
66430 25.0 1495 0.0 4.0 5.0 4.0 5.0 5.0 2.0 5.0 2.0 2.0 4.0 3.0 5.0 3.0 5.0 2.0 0 0 0 0 0
In [ ]:
# Scaling the data
sc=StandardScaler()

# Fit_transform on train data
x_train_scaled=sc.fit_transform(x_train)
x_train_scaled=pd.DataFrame(x_train_scaled, columns=x.columns)

# Transform on test data
x_test_scaled=sc.transform(x_test)
x_test_scaled=pd.DataFrame(x_test_scaled, columns=x.columns)

Model Building¶

Evaluation Metric Functions¶

In [ ]:
# Creating metric function
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))

    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))

    #In this heatmap, make sure the xticklabels are labelled correctly in the format [label if prediction is 0, label if prediction is 1]. In this case, 1 means Satisfied, and 0 means Not Satisfied.
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Satisfied', 'Satisfied'], yticklabels=['Not Satisfied', 'Satisfied'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()
In [ ]:
# Function to compute adjusted R-squared (Only for regression models, but in the end these metrics were not used)
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# Function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100


# Function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    pred = model.predict(predictors)                  # Predict using the independent variables
    r2 = r2_score(target, pred)                       # To compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)    # To compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # To compute RMSE
    mae = mean_absolute_error(target, pred)           # To compute MAE
    mape = mape_score(target, pred)                   # To compute MAPE

    # Creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

Decision Tree¶

In [ ]:
# calculate frequency of target variable
y_train.value_counts()/len(y_train)
Out[ ]:
1    0.54672
0    0.45328
Name: Overall_Experience, dtype: float64
In [ ]:
# Building decision tree model
dt = DecisionTreeClassifier(class_weight = {0: 0.54672, 1: 0.45328}, random_state = 1)
# Fitting decision tree model
dt.fit(x_train, y_train)
# Checking performance on the training dataset
y_train_pred_dt = dt.predict(x_train)
# Checking performance on the test dataset
y_test_pred_dt = dt.predict(x_test)
In [ ]:
#evaluation on train set
metrics_score(y_train, y_train_pred_dt)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     34224
           1       1.00      1.00      1.00     41279

    accuracy                           1.00     75503
   macro avg       1.00      1.00      1.00     75503
weighted avg       1.00      1.00      1.00     75503

In [ ]:
#evaluation on train set
metrics_score(y_test, y_test_pred_dt)
              precision    recall  f1-score   support

           0       0.92      0.92      0.92      8562
           1       0.93      0.94      0.93     10314

    accuracy                           0.93     18876
   macro avg       0.93      0.93      0.93     18876
weighted avg       0.93      0.93      0.93     18876

Observations:

  • It's evident that the DT is overfitting as ther accuracy is 1 in the train set and 0.93 in the test set.
  • We defently can do better but 0.93 on test sample is quite good as a starting point.
  • Sensitivity and precision are 0.93 and 0.94.
In [ ]:
# Plotting the feature's importance

#Extracting the importance from the decision tree
importances = dt.feature_importances_

#Extracting the deatures/independent variables
columns = x.columns

#Putting the feature's importance into a dataframe
importance_df = pd.DataFrame(importances,
                             index = columns,
                             columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

#Setting the plot's size
plt.figure(figsize = (13, 13))

#Plot the barplot for feature importance
sns.barplot(x=importance_df.Importance,y=importance_df.index)
Out[ ]:
<Axes: xlabel='Importance', ylabel='None'>

Observations:

  • Onboarding entertainment is largly the most important explanatory variable in the dataset
  • Seat comfort and ease of online booking are also very useful to predict the overall experience
  • Customer type, distance, platform location, age add also information to the model.
  • the rest of the variables add just a little information to the model

Using Decision Tree as the baseline score, we developed more advanced models to see if we can reach a higher score.

Random Forest¶

In [ ]:
# Fitting the Random Forest classifier on the training data
rf = RandomForestClassifier(class_weight = {0: 0.54672, 1: 0.45328}, random_state = 1)
rf.fit(x_train, y_train)
#predict train sample
y_pred_train_rf = rf.predict(x_train)
#predict test sample
y_pred_test_rf = rf.predict(x_test)
In [ ]:
metrics_score(y_train, y_pred_train_rf)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     34224
           1       1.00      1.00      1.00     41279

    accuracy                           1.00     75503
   macro avg       1.00      1.00      1.00     75503
weighted avg       1.00      1.00      1.00     75503

In [ ]:
metrics_score(y_test, y_pred_test_rf)
              precision    recall  f1-score   support

           0       0.94      0.95      0.94      8562
           1       0.96      0.95      0.95     10314

    accuracy                           0.95     18876
   macro avg       0.95      0.95      0.95     18876
weighted avg       0.95      0.95      0.95     18876

Observations:

  • The model is overfitting the accuracy of the train set is 1 and 0.95 of the test set.
  • The accuracy Improved comparing with the Decision Tree.
In [ ]:
# Plot the feature importance

#Extract importance
importances = rf.feature_importances_

#Extract features
columns = x.columns

#Put data into a dataframe
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

#Determine the figure size for plot
plt.figure(figsize = (13, 13))

#Plot a bar plot
sns.barplot(x=importance_df.Importance,y=importance_df.index)
Out[ ]:
<Axes: xlabel='Importance', ylabel='None'>

Observations:

  • Onboard entertainment is the most important variable
  • seat comfort is also a very important feature
  • ease of online booking and online support is also very appreciated by passengers
  • Legroom, travel distance, catering, costumer type, onboarding service, travel class add information to the model

RF Pruning¶

In [ ]:
# try different trees changing the max_depth parameter to find the model which maximize precision
depths = [6,7,8,9,10,11,12,13,14,15]
precisions_test = []
recalls_test = []
precisions_train = []
recalls_train = []

for depth in depths:
    #Creating the Random Forest Classifier
    rf = RandomForestClassifier(class_weight = {0: 0.54672, 1: 0.45328},
                                random_state = 1,
                                max_depth = depth) #Here we are testing the different depth values

    #Fit the model onto our training data
    rf.fit(x_train, y_train)

    #Making the predictions
    y_test_pred_rf = rf.predict(x_test)
    y_train_pred_rf = rf.predict(x_train)

    #Recording the prediction results to get the precision and recall metrics
    precisions_test.append(precision_score(y_test, y_test_pred_rf))
    recalls_test.append(recall_score(y_test, y_test_pred_rf))
    precisions_train.append(precision_score(y_train, y_train_pred_rf))
    recalls_train.append(recall_score(y_train, y_train_pred_rf))

#Creating a dataframe to store the preceision and recall values at each depth
pruning_rf = pd.DataFrame()
pruning_rf['depth']=depths
pruning_rf['precision_test'] = precisions_test
pruning_rf['precision_train'] = precisions_train
pruning_rf['recall_test'] = recalls_test
pruning_rf['recall_train'] = recalls_train

pruning_rf
Out[ ]:
depth precision_test precision_train recall_test recall_train
0 6 0.907355 0.910934 0.907795 0.911287
1 7 0.915613 0.921630 0.913128 0.916495
2 8 0.922306 0.926203 0.918460 0.923084
3 9 0.927616 0.934801 0.920690 0.928777
4 10 0.938058 0.947071 0.926508 0.933695
5 11 0.940766 0.953004 0.928544 0.940260
6 12 0.947861 0.963969 0.932422 0.945614
7 13 0.952146 0.971956 0.935621 0.952954
8 14 0.955839 0.979099 0.938045 0.961191
9 15 0.959109 0.985690 0.939209 0.967853
In [ ]:
#Plotting the above data to vizualize which depth has best precision
plt.figure(figsize=(10,6))
plt.plot(pruning_rf['depth'], pruning_rf['precision_test'], label='Precision Test', marker='o')
plt.plot(pruning_rf['depth'], pruning_rf['recall_test'], label='Recall Test', marker='o')
plt.plot(pruning_rf['depth'], pruning_rf['precision_train'], label='Precision train', marker='o')
plt.plot(pruning_rf['depth'], pruning_rf['recall_train'], label='Recall train', marker='o')

#Determining plot appearance
plt.xlabel('Depth')
plt.ylabel('Value')
plt.title('Random forest: Precision and Recall Test vs Depth')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Since all metrics keep growing even after depth=15, let us build a random forest with a depth of 15 and see if the RF will perform better.

In [ ]:
# Fitting the Random Forest classifier on the training data
rf2 = RandomForestClassifier(class_weight = {0: 0.54672, 1: 0.45328}, random_state = 1, max_depth =15)
rf2.fit(x_train, y_train)
#predict train sample
y_pred_train_rf2 = rf2.predict(x_train)
#predict test sample
y_pred_test_rf2 = rf2.predict(x_test)
In [ ]:
#Testing against training data
metrics_score(y_train, y_pred_train_rf2)
              precision    recall  f1-score   support

           0       0.96      0.98      0.97     34224
           1       0.99      0.97      0.98     41279

    accuracy                           0.97     75503
   macro avg       0.97      0.98      0.97     75503
weighted avg       0.97      0.97      0.97     75503

In [ ]:
#Testing against testing data
metrics_score(y_test, y_pred_test_rf2)
              precision    recall  f1-score   support

           0       0.93      0.95      0.94      8562
           1       0.96      0.94      0.95     10314

    accuracy                           0.94     18876
   macro avg       0.94      0.95      0.94     18876
weighted avg       0.95      0.94      0.94     18876

The performance of the model on test data is lower than the first RF model, so we will need to keep pruning to see if we can achieve a better performance.

In [ ]:
# try different trees changing the max_depth parameter to find the model which maximise accuracy
depths = [12,13,14,15,16,17,18,19,20]
accuracy_test = []
accuracy_train = []

for depth in depths:
    #Creating the Random Forest Classifier
    rf = RandomForestClassifier(class_weight = {0: 0.54672, 1: 0.45328},
                                random_state = 1,
                                max_depth = depth) #Using a higher range of depth, since the metrics improves with depths higher than 15

    #Fit the model onto our training data
    rf.fit(x_train, y_train)

    #Making the predictions
    y_test_pred_rf = rf.predict(x_test)
    y_train_pred_rf = rf.predict(x_train)

    #Recording the prediction results to get the accuracy metrics
    accuracy_test.append(accuracy_score(y_test, y_test_pred_rf))
    accuracy_train.append(accuracy_score(y_train, y_train_pred_rf))

#Creating a dataframe to store the accuracy values at each depth
pruning_rf = pd.DataFrame()
pruning_rf['depth']=depths
pruning_rf['accuracy_test'] = accuracy_test
pruning_rf['accuracy_train'] = accuracy_train

pruning_rf
Out[ ]:
depth accuracy_test accuracy_train
0 12 0.935050 0.950942
1 13 0.939129 0.959247
2 14 0.942467 0.967564
3 15 0.944904 0.974743
4 16 0.946652 0.981696
5 17 0.947764 0.986159
6 18 0.948559 0.990133
7 19 0.949089 0.993020
8 20 0.948135 0.995563
In [ ]:
#Plotting the accuracy of model against depth
plt.figure(figsize=(10,6))
plt.plot(pruning_rf['depth'], pruning_rf['accuracy_test'], label='accuracy_test', marker='o')
plt.plot(pruning_rf['depth'], pruning_rf['accuracy_train'], label='accuracy_train', marker='o')

plt.xlabel('Depth')
plt.ylabel('Value')
plt.title('Random forest: Accuracy vs Depth')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

As shown from the plot, accuracy for training data begins to dip at depth=19.

So we will create a RF model using depth = 19

In [ ]:
# Fitting the Random Forest classifier on the training data
rf3 = RandomForestClassifier(class_weight = {0: 0.54672, 1: 0.45328}, random_state = 1, max_depth =19)
rf3.fit(x_train, y_train)
#predict train sample
y_pred_train_rf3 = rf3.predict(x_train)
#predict test sample
y_pred_test_rf3 = rf3.predict(x_test)
In [ ]:
metrics_score(y_train, y_pred_train_rf3)
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     34224
           1       1.00      0.99      0.99     41279

    accuracy                           0.99     75503
   macro avg       0.99      0.99      0.99     75503
weighted avg       0.99      0.99      0.99     75503

In [ ]:
metrics_score(y_test, y_pred_test_rf3)
              precision    recall  f1-score   support

           0       0.94      0.95      0.94      8562
           1       0.96      0.95      0.95     10314

    accuracy                           0.95     18876
   macro avg       0.95      0.95      0.95     18876
weighted avg       0.95      0.95      0.95     18876

Observations:

  • This model of RF performs a little better than the base model
  • Model is still overfitting as the accuracy on training data is 0.99, but the accuracy on testing data is 0.95.
  • Model could be further improved by more fine pruning.

With a contraint on time, we moved on to developing deep learning models.

Proposed further pruning of RF Model¶

In this section, the pruning parameters were too much for google colab to process, taking more than a day of buffering to complete.

Therefore, I suggest to use these parameters as a guide and reduce the number of parameters to prune your RF model.

Let me know if your optimal parameters if you did try this out! :D

In [ ]:
'''
# Choosing the Classifier as the estimator
rf_estimator_tuned = RandomForestClassifier(random_state = 1)

# List of parameters that will be used for the gridsearch
params_rf = {'criterion': ['entropy','gini','log_loss'],
             "n_estimators": [100, 250, 500], #No. of trees
             "min_samples_leaf": np.arange(1,5,1),
             "max_features": [0.3,0.5,0.7, 0.9, 'auto','sqrt'],
             "max_depth":[16,17,18,19,20,21], #Min. samples in each leaf
             "min_samples_split": [2,3,4,5,6,7],
             "bootstrap":[True,False],
             "class_weight":[None,'balanced']
}
# Using precision score for class 1
scorer = metrics.make_scorer(precision_score, pos_label = 1)

# Run the grid search function-
grid_obj = GridSearchCV(estimator = rf_estimator_tuned,
                        param_grid = params_rf,
                        scoring = scorer,
                        cv = 5)

# Fit the gridsearch onto the training data
grid_obj = grid_obj.fit(x_train_scaled, y_train)

best_params = grid_obj.best_params_
best_score = grid_obj.best_score_
print("Best Parameters:", best_params)
print("Best Score:", best_score)
'''
Out[ ]:
'\n# Choosing the Classifier as the estimator\nrf_estimator_tuned = RandomForestClassifier(random_state = 1)\n\n# List of parameters that will be used for the gridsearch\nparams_rf = {\'criterion\': [\'entropy\',\'gini\',\'log_loss\'],\n             "n_estimators": [100, 250, 500], #No. of trees\n             "min_samples_leaf": np.arange(1,5,1),\n             "max_features": [0.3,0.5,0.7, 0.9, \'auto\',\'sqrt\'],\n             "max_depth":[16,17,18,19,20,21], #Min. samples in each leaf\n             "min_samples_split": [2,3,4,5,6,7],\n             "bootstrap":[True,False],\n             "class_weight":[None,\'balanced\']\n}\n# Using precision score for class 1\nscorer = metrics.make_scorer(precision_score, pos_label = 1)\n\n# Run the grid search function-\ngrid_obj = GridSearchCV(estimator = rf_estimator_tuned,\n                        param_grid = params_rf,\n                        scoring = scorer,\n                        cv = 5)\n\n# Fit the gridsearch onto the training data\ngrid_obj = grid_obj.fit(x_train_scaled, y_train)\n\nbest_params = grid_obj.best_params_\nbest_score = grid_obj.best_score_\nprint("Best Parameters:", best_params)\nprint("Best Score:", best_score)\n'
In [ ]:
'''
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_

# Fit the best estimator to the data
rf_estimator_tuned.fit(x_train, y_train)

#predict train sample
y_pred_train_rf_tuned = rf_tuned.predict(x_train)
#predict test sample
y_pred_test_rf_tuned = rf_tuned.predict(x_test)

metrics_score(y_train, y_pred_train_rf_tuned)
'''
Out[ ]:
'\n# Set the classifier to the best combination of parameters\nrf_estimator_tuned = grid_obj.best_estimator_\n\n# Fit the best estimator to the data\nrf_estimator_tuned.fit(x_train, y_train)\n\n#predict train sample\ny_pred_train_rf_tuned = rf_tuned.predict(x_train)\n#predict test sample\ny_pred_test_rf_tuned = rf_tuned.predict(x_test)\n\nmetrics_score(y_train, y_pred_train_rf_tuned)\n'
In [ ]:
'''
metrics_score(y_test, y_pred_test_rf_tuned)
'''
Out[ ]:
'\nmetrics_score(y_test, y_pred_test_rf_tuned)\n'

Deep Learning¶

Model 1 - Base Model¶

In [ ]:
# Fixing the seed for random number generators
np.random.seed(42)

import random
random.seed(42)

tf.random.set_seed(42)
In [ ]:
# Initialize sequential model
model_1 = tf.keras.Sequential([tf.keras.layers.Flatten(input_shape=(23,)),#Input layer
                               tf.keras.layers.Dense(128, activation='relu'), #Hidden layer
                               tf.keras.layers.Dense(64, activation='relu'), #Hidden layer
                               tf.keras.layers.Dense(1, activation='sigmoid')]) #Output layer, only 1 node because we only have 1 predictor
In [ ]:
#Using the settings for the sequential model above, create the model with the following algorithms
model_1.compile(loss = 'binary_crossentropy',
                optimizer='adamax',
                metrics=['accuracy'])

#Show the model summary
model_1.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 flatten (Flatten)           (None, 23)                0         
                                                                 
 dense (Dense)               (None, 128)               3072      
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 11,393
Trainable params: 11,393
Non-trainable params: 0
_________________________________________________________________
In [ ]:
# Let us now fit the model onto our data
history1 = model_1.fit(x_train_scaled,
                          y_train,
                          validation_split=0.2, #20% for validation data
                          verbose=1, #It writes the verbiage for the training progress. A higher number would give more information
                          epochs=50, #Number of times the model goes through the entire training dataset
                          batch_size=32) #This is the batch Stochastic Gradient Descend method, with batchsize per training step of 64
Epoch 1/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.2765 - accuracy: 0.8852 - val_loss: 0.2222 - val_accuracy: 0.9089
Epoch 2/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.2085 - accuracy: 0.9122 - val_loss: 0.1972 - val_accuracy: 0.9163
Epoch 3/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.1890 - accuracy: 0.9208 - val_loss: 0.1843 - val_accuracy: 0.9238
Epoch 4/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.1760 - accuracy: 0.9259 - val_loss: 0.1759 - val_accuracy: 0.9283
Epoch 5/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1663 - accuracy: 0.9306 - val_loss: 0.1683 - val_accuracy: 0.9286
Epoch 6/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1586 - accuracy: 0.9336 - val_loss: 0.1624 - val_accuracy: 0.9300
Epoch 7/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.1512 - accuracy: 0.9373 - val_loss: 0.1560 - val_accuracy: 0.9352
Epoch 8/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1456 - accuracy: 0.9392 - val_loss: 0.1507 - val_accuracy: 0.9346
Epoch 9/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1408 - accuracy: 0.9412 - val_loss: 0.1483 - val_accuracy: 0.9357
Epoch 10/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.1362 - accuracy: 0.9432 - val_loss: 0.1450 - val_accuracy: 0.9378
Epoch 11/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1326 - accuracy: 0.9437 - val_loss: 0.1451 - val_accuracy: 0.9391
Epoch 12/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.1290 - accuracy: 0.9447 - val_loss: 0.1417 - val_accuracy: 0.9393
Epoch 13/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.1262 - accuracy: 0.9471 - val_loss: 0.1377 - val_accuracy: 0.9408
Epoch 14/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1232 - accuracy: 0.9486 - val_loss: 0.1381 - val_accuracy: 0.9433
Epoch 15/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.1205 - accuracy: 0.9499 - val_loss: 0.1414 - val_accuracy: 0.9391
Epoch 16/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1184 - accuracy: 0.9504 - val_loss: 0.1346 - val_accuracy: 0.9427
Epoch 17/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1162 - accuracy: 0.9523 - val_loss: 0.1331 - val_accuracy: 0.9445
Epoch 18/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.1139 - accuracy: 0.9525 - val_loss: 0.1330 - val_accuracy: 0.9450
Epoch 19/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1124 - accuracy: 0.9536 - val_loss: 0.1324 - val_accuracy: 0.9421
Epoch 20/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1105 - accuracy: 0.9536 - val_loss: 0.1322 - val_accuracy: 0.9432
Epoch 21/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.1089 - accuracy: 0.9541 - val_loss: 0.1338 - val_accuracy: 0.9423
Epoch 22/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1069 - accuracy: 0.9554 - val_loss: 0.1296 - val_accuracy: 0.9431
Epoch 23/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1062 - accuracy: 0.9549 - val_loss: 0.1322 - val_accuracy: 0.9442
Epoch 24/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.1043 - accuracy: 0.9566 - val_loss: 0.1283 - val_accuracy: 0.9447
Epoch 25/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.1031 - accuracy: 0.9572 - val_loss: 0.1273 - val_accuracy: 0.9471
Epoch 26/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.1014 - accuracy: 0.9577 - val_loss: 0.1292 - val_accuracy: 0.9464
Epoch 27/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.1008 - accuracy: 0.9574 - val_loss: 0.1280 - val_accuracy: 0.9456
Epoch 28/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0991 - accuracy: 0.9588 - val_loss: 0.1285 - val_accuracy: 0.9422
Epoch 29/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0983 - accuracy: 0.9592 - val_loss: 0.1264 - val_accuracy: 0.9471
Epoch 30/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0972 - accuracy: 0.9594 - val_loss: 0.1290 - val_accuracy: 0.9460
Epoch 31/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0963 - accuracy: 0.9594 - val_loss: 0.1319 - val_accuracy: 0.9438
Epoch 32/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0948 - accuracy: 0.9600 - val_loss: 0.1275 - val_accuracy: 0.9461
Epoch 33/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0943 - accuracy: 0.9602 - val_loss: 0.1267 - val_accuracy: 0.9481
Epoch 34/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0929 - accuracy: 0.9616 - val_loss: 0.1250 - val_accuracy: 0.9468
Epoch 35/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0922 - accuracy: 0.9613 - val_loss: 0.1276 - val_accuracy: 0.9456
Epoch 36/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0913 - accuracy: 0.9619 - val_loss: 0.1268 - val_accuracy: 0.9458
Epoch 37/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0902 - accuracy: 0.9623 - val_loss: 0.1281 - val_accuracy: 0.9465
Epoch 38/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0892 - accuracy: 0.9628 - val_loss: 0.1289 - val_accuracy: 0.9434
Epoch 39/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0886 - accuracy: 0.9633 - val_loss: 0.1264 - val_accuracy: 0.9470
Epoch 40/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0874 - accuracy: 0.9632 - val_loss: 0.1261 - val_accuracy: 0.9462
Epoch 41/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0869 - accuracy: 0.9635 - val_loss: 0.1293 - val_accuracy: 0.9432
Epoch 42/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0868 - accuracy: 0.9634 - val_loss: 0.1275 - val_accuracy: 0.9480
Epoch 43/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0858 - accuracy: 0.9638 - val_loss: 0.1311 - val_accuracy: 0.9448
Epoch 44/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0849 - accuracy: 0.9648 - val_loss: 0.1282 - val_accuracy: 0.9460
Epoch 45/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0843 - accuracy: 0.9646 - val_loss: 0.1274 - val_accuracy: 0.9477
Epoch 46/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0834 - accuracy: 0.9649 - val_loss: 0.1301 - val_accuracy: 0.9462
Epoch 47/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0828 - accuracy: 0.9654 - val_loss: 0.1327 - val_accuracy: 0.9458
Epoch 48/50
1888/1888 [==============================] - 4s 2ms/step - loss: 0.0817 - accuracy: 0.9663 - val_loss: 0.1311 - val_accuracy: 0.9439
Epoch 49/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0812 - accuracy: 0.9668 - val_loss: 0.1300 - val_accuracy: 0.9453
Epoch 50/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0806 - accuracy: 0.9665 - val_loss: 0.1290 - val_accuracy: 0.9464
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history1.history['loss'])
plt.plot(history1.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
In [ ]:
#Plotting Epoch vs accuracy
plt.plot(history1.history['accuracy'])
plt.plot(history1.history['val_accuracy'])

plt.title('Accuracy vs Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()
In [ ]:
y_pred=model_1.predict(x_test_scaled)
y_pred = (y_pred > 0.5)
y_pred
590/590 [==============================] - 1s 1ms/step
Out[ ]:
array([[ True],
       [ True],
       [ True],
       ...,
       [False],
       [ True],
       [False]])
In [ ]:
metrics_score(y_train,y_train)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     34224
           1       1.00      1.00      1.00     41279

    accuracy                           1.00     75503
   macro avg       1.00      1.00      1.00     75503
weighted avg       1.00      1.00      1.00     75503

In [ ]:
metrics_score(y_test,y_pred)
              precision    recall  f1-score   support

           0       0.93      0.95      0.94      8562
           1       0.96      0.94      0.95     10314

    accuracy                           0.95     18876
   macro avg       0.94      0.95      0.95     18876
weighted avg       0.95      0.95      0.95     18876

Observations:

  • The base model is overfitting as the accuracy on training data is 1.00, but accuracy on testing data is 0.95. Quite similar to the Decision Tree and Random Forest models.

We can use the ROC-AUC method to further tune this base model by determining how sensitive or specific the model should be.

In the ROC-AUC method, we are telling the model to change the threshold used to classify if a result is positive or negative. By default, the model makes a prediction of 0 - 1, and a value above 0.5 will be classified as positive. By increasing that value to 0.6 for example, we are making the model more specific and less sensitive, by having a higher requirement to label a result as positive.

The more specific a model is, the lesser false positive/false negatives will be made by the model. However, a specific model will label more true positive/negatives wrongly.

On the other hand, a more sensitive model will successfully label a true positive as positive, but is less accurate.

Tuning via ROC-AUC

In [ ]:
from sklearn.metrics import roc_curve
from matplotlib import pyplot

# predict probabilities
yhat1 = model_1.predict(x_test_scaled)

# keep probabilities for the positive outcome only
yhat1 = yhat1[:, 0]

# calculate roc curves
fpr, tpr, thresholds1 = roc_curve(y_test, yhat1)

# calculate the g-mean for each threshold
gmeans1 = np.sqrt(tpr * (1-fpr))

# locate the index of the largest g-mean
ix = np.argmax(gmeans1)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds1[ix], gmeans1[ix]))

# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')

# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()

# show the plot
pyplot.show()
590/590 [==============================] - 1s 1ms/step
Best Threshold=0.590928, G-Mean=0.946
In [ ]:
#Predicting test data using best threshold

#Making the prediction using the test data
y_pred_e1=model_1.predict(x_test_scaled)

#Using the threshold value to convert the predicted data into true or false statements. If the predicted data is higher than threshold, it will be labelled true.
y_pred_e1 = (y_pred_e1 > thresholds1[ix])

#Just to see what the data looks like
y_pred_e1
590/590 [==============================] - 1s 1ms/step
Out[ ]:
array([[ True],
       [ True],
       [ True],
       ...,
       [False],
       [ True],
       [False]])
In [ ]:
metrics_score(y_test, y_pred_e1)
              precision    recall  f1-score   support

           0       0.92      0.96      0.94      8562
           1       0.97      0.93      0.95     10314

    accuracy                           0.95     18876
   macro avg       0.94      0.95      0.94     18876
weighted avg       0.95      0.95      0.95     18876

Observations:

  • The model performs almost the same to before the ROC-AUC tuning. This means that a threshold value of 0.5 is accurate enough

Next, we will use gridsearchCV to improve the model.

Model 2 - Increasing number of hidden layers and nodes¶

Differences from Model 1:

  • We added 2 more hiddden layers, resulting in more trainable parameters.
  • We are now using adam optimizer instead of adamax
In [ ]:
#Part of best practice to clear the backend session before moving onto the next model, this will free up resources
backend.clear_session()

#Fixing the random elements again
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
model_2 = Sequential()

#Adding the hidden and output layers
model_2.add(Dense(256,activation='relu',kernel_initializer='he_uniform',input_dim = x_train_scaled.shape[1])) #New hidden layer
model_2.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model_2.add(Dense(64,activation='relu',kernel_initializer='he_uniform'))
model_2.add(Dense(32,activation='relu',kernel_initializer='he_uniform')) #New hidden layer
model_2.add(Dense(1, activation = 'sigmoid'))

# Here, we use the Adam optimizer
optimizer = tf.keras.optimizers.Adam(0.001)

#Compiling the ANN with Adam optimizer and binary cross entropy loss function
model_2.compile(loss='binary_crossentropy',
                optimizer = optimizer,
                metrics=['accuracy'])
In [ ]:
model_2.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 256)               6144      
                                                                 
 dense_1 (Dense)             (None, 128)               32896     
                                                                 
 dense_2 (Dense)             (None, 64)                8256      
                                                                 
 dense_3 (Dense)             (None, 32)                2080      
                                                                 
 dense_4 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 49,409
Trainable params: 49,409
Non-trainable params: 0
_________________________________________________________________
In [ ]:
#Fit the model onto our data using the following settings
history_2 = model_2.fit(x_train_scaled,
                        y_train,
                        batch_size=32, #Batch size of 64 instead of 32, as the Train loss vs validation loss graph did not converge.
                        epochs=50, #Number of times the model goes through the entire training dataset
                        verbose=1,
                        validation_split = 0.2)
Epoch 1/50
1888/1888 [==============================] - 8s 4ms/step - loss: 0.2187 - accuracy: 0.9078 - val_loss: 0.1678 - val_accuracy: 0.9270
Epoch 2/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.1565 - accuracy: 0.9322 - val_loss: 0.1494 - val_accuracy: 0.9378
Epoch 3/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.1357 - accuracy: 0.9420 - val_loss: 0.1411 - val_accuracy: 0.9387
Epoch 4/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.1237 - accuracy: 0.9467 - val_loss: 0.1320 - val_accuracy: 0.9432
Epoch 5/50
1888/1888 [==============================] - 8s 4ms/step - loss: 0.1154 - accuracy: 0.9504 - val_loss: 0.1298 - val_accuracy: 0.9451
Epoch 6/50
1888/1888 [==============================] - 13s 7ms/step - loss: 0.1086 - accuracy: 0.9538 - val_loss: 0.1333 - val_accuracy: 0.9396
Epoch 7/50
1888/1888 [==============================] - 15s 8ms/step - loss: 0.1032 - accuracy: 0.9553 - val_loss: 0.1344 - val_accuracy: 0.9459
Epoch 8/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0965 - accuracy: 0.9584 - val_loss: 0.1305 - val_accuracy: 0.9440
Epoch 9/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0929 - accuracy: 0.9594 - val_loss: 0.1327 - val_accuracy: 0.9418
Epoch 10/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0881 - accuracy: 0.9619 - val_loss: 0.1311 - val_accuracy: 0.9465
Epoch 11/50
1888/1888 [==============================] - 8s 4ms/step - loss: 0.0837 - accuracy: 0.9643 - val_loss: 0.1434 - val_accuracy: 0.9460
Epoch 12/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0789 - accuracy: 0.9653 - val_loss: 0.1489 - val_accuracy: 0.9441
Epoch 13/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0768 - accuracy: 0.9669 - val_loss: 0.1362 - val_accuracy: 0.9459
Epoch 14/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0732 - accuracy: 0.9675 - val_loss: 0.1425 - val_accuracy: 0.9458
Epoch 15/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0698 - accuracy: 0.9694 - val_loss: 0.1548 - val_accuracy: 0.9424
Epoch 16/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0654 - accuracy: 0.9718 - val_loss: 0.1491 - val_accuracy: 0.9462
Epoch 17/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0622 - accuracy: 0.9726 - val_loss: 0.1660 - val_accuracy: 0.9448
Epoch 18/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0624 - accuracy: 0.9731 - val_loss: 0.1654 - val_accuracy: 0.9462
Epoch 19/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0572 - accuracy: 0.9751 - val_loss: 0.1804 - val_accuracy: 0.9428
Epoch 20/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0559 - accuracy: 0.9756 - val_loss: 0.1758 - val_accuracy: 0.9452
Epoch 21/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0531 - accuracy: 0.9766 - val_loss: 0.1871 - val_accuracy: 0.9413
Epoch 22/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0500 - accuracy: 0.9773 - val_loss: 0.1816 - val_accuracy: 0.9436
Epoch 23/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0488 - accuracy: 0.9784 - val_loss: 0.1932 - val_accuracy: 0.9430
Epoch 24/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0459 - accuracy: 0.9804 - val_loss: 0.2022 - val_accuracy: 0.9462
Epoch 25/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0442 - accuracy: 0.9817 - val_loss: 0.2309 - val_accuracy: 0.9435
Epoch 26/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0416 - accuracy: 0.9821 - val_loss: 0.2103 - val_accuracy: 0.9436
Epoch 27/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0416 - accuracy: 0.9821 - val_loss: 0.2270 - val_accuracy: 0.9443
Epoch 28/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0404 - accuracy: 0.9834 - val_loss: 0.2216 - val_accuracy: 0.9415
Epoch 29/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0374 - accuracy: 0.9843 - val_loss: 0.2329 - val_accuracy: 0.9417
Epoch 30/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0383 - accuracy: 0.9845 - val_loss: 0.2293 - val_accuracy: 0.9402
Epoch 31/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0345 - accuracy: 0.9857 - val_loss: 0.2473 - val_accuracy: 0.9447
Epoch 32/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0345 - accuracy: 0.9857 - val_loss: 0.2444 - val_accuracy: 0.9425
Epoch 33/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0337 - accuracy: 0.9861 - val_loss: 0.2411 - val_accuracy: 0.9437
Epoch 34/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0316 - accuracy: 0.9870 - val_loss: 0.2432 - val_accuracy: 0.9398
Epoch 35/50
1888/1888 [==============================] - 8s 4ms/step - loss: 0.0304 - accuracy: 0.9876 - val_loss: 0.2674 - val_accuracy: 0.9429
Epoch 36/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0293 - accuracy: 0.9884 - val_loss: 0.2663 - val_accuracy: 0.9399
Epoch 37/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0295 - accuracy: 0.9885 - val_loss: 0.2730 - val_accuracy: 0.9418
Epoch 38/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0277 - accuracy: 0.9886 - val_loss: 0.2880 - val_accuracy: 0.9426
Epoch 39/50
1888/1888 [==============================] - 8s 4ms/step - loss: 0.0253 - accuracy: 0.9902 - val_loss: 0.2823 - val_accuracy: 0.9417
Epoch 40/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0259 - accuracy: 0.9897 - val_loss: 0.3068 - val_accuracy: 0.9401
Epoch 41/50
1888/1888 [==============================] - 8s 4ms/step - loss: 0.0258 - accuracy: 0.9898 - val_loss: 0.3092 - val_accuracy: 0.9424
Epoch 42/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0241 - accuracy: 0.9905 - val_loss: 0.3240 - val_accuracy: 0.9399
Epoch 43/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0259 - accuracy: 0.9901 - val_loss: 0.2885 - val_accuracy: 0.9416
Epoch 44/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0226 - accuracy: 0.9913 - val_loss: 0.3070 - val_accuracy: 0.9409
Epoch 45/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0221 - accuracy: 0.9918 - val_loss: 0.3312 - val_accuracy: 0.9415
Epoch 46/50
1888/1888 [==============================] - 5s 3ms/step - loss: 0.0211 - accuracy: 0.9918 - val_loss: 0.3351 - val_accuracy: 0.9423
Epoch 47/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0221 - accuracy: 0.9911 - val_loss: 0.3463 - val_accuracy: 0.9434
Epoch 48/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0225 - accuracy: 0.9917 - val_loss: 0.3393 - val_accuracy: 0.9401
Epoch 49/50
1888/1888 [==============================] - 6s 3ms/step - loss: 0.0204 - accuracy: 0.9922 - val_loss: 0.3236 - val_accuracy: 0.9394
Epoch 50/50
1888/1888 [==============================] - 7s 4ms/step - loss: 0.0204 - accuracy: 0.9924 - val_loss: 0.3434 - val_accuracy: 0.9418
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_2.history['loss'])
plt.plot(history_2.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

Observations:

  • From the chart above, it can be seen that the validation loss increases with increasing epoch
  • This may mean that with more epoch, the model will overfit more towards the training data.
In [ ]:
#Studying the difference between accuracy and epoch
plt.plot(history_2.history['accuracy'])
plt.plot(history_2.history['val_accuracy'])
plt.title('Accuracy vs Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()

Observations:

  • The accuracy of the training data is constantly high at 1.00
  • Meanwhile, the accuracy of the model on validation data is around 0.94.
  • This is a sign of overfitting, and could be due to the increased in number of hidden layers.
In [ ]:
# predict probabilities
yhat2 = model_2.predict(x_test_scaled)
# keep probabilities for the positive outcome only
yhat2 = yhat2[:, 0]
# calculate roc curves
fpr, tpr, thresholds2 = roc_curve(y_test, yhat2)
# calculate the g-mean for each threshold
gmeans2 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans2)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds2[ix], gmeans2[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
590/590 [==============================] - 1s 1ms/step
Best Threshold=0.767428, G-Mean=0.945
In [ ]:
#Predicting the results using best as a threshold
y_pred_e2=model_2.predict(x_test_scaled)
y_pred_e2 = (y_pred_e2 > thresholds2[ix])
y_pred_e2
590/590 [==============================] - 1s 2ms/step
Out[ ]:
array([[ True],
       [ True],
       [ True],
       ...,
       [False],
       [ True],
       [False]])
In [ ]:
metrics_score(y_test, y_pred_e2)
              precision    recall  f1-score   support

           0       0.92      0.96      0.94      8562
           1       0.96      0.93      0.95     10314

    accuracy                           0.94     18876
   macro avg       0.94      0.95      0.94     18876
weighted avg       0.94      0.94      0.94     18876

Observations:

  • The overall accuracy on test data is 0.94, less accurate than the first model. This is due to the increaesed number of layers, which ahd cause overfitting to occur.

In the next model, we will reduce the number of layers, but utilize batch normalization

Model 3 - Batch Normalization¶

Difference from model 2:

  • Batch Normalization between layers
  • Number of layers back to model 1
  • Optimizer is still Adam
In [ ]:
#Clearing the backend resources
backend.clear_session()

#Fixing the randomness
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
#Creating the 3rd model
model_3 = Sequential()

#Adding hidden and output layers
model_3.add(Dense(128,activation='relu',input_dim = x_train_scaled.shape[1]))
model_3.add(BatchNormalization())
model_3.add(Dense(64,activation='relu',kernel_initializer='he_uniform'))
model_3.add(BatchNormalization())
model_3.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
model_3.add(Dense(1, activation = 'sigmoid'))
In [ ]:
model_3.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 128)               3072      
                                                                 
 batch_normalization (BatchN  (None, 128)              512       
 ormalization)                                                   
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 batch_normalization_1 (Batc  (None, 64)               256       
 hNormalization)                                                 
                                                                 
 dense_2 (Dense)             (None, 32)                2080      
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 14,209
Trainable params: 13,825
Non-trainable params: 384
_________________________________________________________________
In [ ]:
optimizer = tf.keras.optimizers.Adam(0.001)
model_3.compile(loss='binary_crossentropy',
                optimizer=optimizer,
                metrics=['accuracy'])
In [ ]:
history_3 = model_3.fit(x_train_scaled,
                        y_train,
                        batch_size=64,
                        epochs=50,
                        verbose=1,
                        validation_split = 0.2)
Epoch 1/50
944/944 [==============================] - 5s 4ms/step - loss: 0.2380 - accuracy: 0.9004 - val_loss: 0.1738 - val_accuracy: 0.9266
Epoch 2/50
944/944 [==============================] - 4s 5ms/step - loss: 0.1710 - accuracy: 0.9271 - val_loss: 0.1496 - val_accuracy: 0.9382
Epoch 3/50
944/944 [==============================] - 3s 4ms/step - loss: 0.1503 - accuracy: 0.9366 - val_loss: 0.1365 - val_accuracy: 0.9407
Epoch 4/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1386 - accuracy: 0.9410 - val_loss: 0.1341 - val_accuracy: 0.9431
Epoch 5/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1313 - accuracy: 0.9437 - val_loss: 0.1289 - val_accuracy: 0.9454
Epoch 6/50
944/944 [==============================] - 4s 5ms/step - loss: 0.1261 - accuracy: 0.9463 - val_loss: 0.1294 - val_accuracy: 0.9430
Epoch 7/50
944/944 [==============================] - 3s 4ms/step - loss: 0.1199 - accuracy: 0.9480 - val_loss: 0.1248 - val_accuracy: 0.9474
Epoch 8/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1162 - accuracy: 0.9501 - val_loss: 0.1214 - val_accuracy: 0.9480
Epoch 9/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1134 - accuracy: 0.9517 - val_loss: 0.1256 - val_accuracy: 0.9450
Epoch 10/50
944/944 [==============================] - 5s 5ms/step - loss: 0.1095 - accuracy: 0.9527 - val_loss: 0.1227 - val_accuracy: 0.9475
Epoch 11/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1068 - accuracy: 0.9541 - val_loss: 0.1229 - val_accuracy: 0.9476
Epoch 12/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1031 - accuracy: 0.9555 - val_loss: 0.1217 - val_accuracy: 0.9488
Epoch 13/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1018 - accuracy: 0.9561 - val_loss: 0.1202 - val_accuracy: 0.9493
Epoch 14/50
944/944 [==============================] - 4s 5ms/step - loss: 0.1022 - accuracy: 0.9555 - val_loss: 0.1240 - val_accuracy: 0.9472
Epoch 15/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0988 - accuracy: 0.9573 - val_loss: 0.1204 - val_accuracy: 0.9494
Epoch 16/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0974 - accuracy: 0.9585 - val_loss: 0.1205 - val_accuracy: 0.9503
Epoch 17/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0944 - accuracy: 0.9598 - val_loss: 0.1252 - val_accuracy: 0.9470
Epoch 18/50
944/944 [==============================] - 5s 5ms/step - loss: 0.0941 - accuracy: 0.9593 - val_loss: 0.1274 - val_accuracy: 0.9497
Epoch 19/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0921 - accuracy: 0.9606 - val_loss: 0.1260 - val_accuracy: 0.9464
Epoch 20/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0909 - accuracy: 0.9608 - val_loss: 0.1257 - val_accuracy: 0.9471
Epoch 21/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0897 - accuracy: 0.9618 - val_loss: 0.1219 - val_accuracy: 0.9497
Epoch 22/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0897 - accuracy: 0.9610 - val_loss: 0.1244 - val_accuracy: 0.9452
Epoch 23/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0876 - accuracy: 0.9621 - val_loss: 0.1272 - val_accuracy: 0.9478
Epoch 24/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0892 - accuracy: 0.9612 - val_loss: 0.1255 - val_accuracy: 0.9459
Epoch 25/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0848 - accuracy: 0.9644 - val_loss: 0.1231 - val_accuracy: 0.9513
Epoch 26/50
944/944 [==============================] - 5s 5ms/step - loss: 0.0860 - accuracy: 0.9630 - val_loss: 0.1301 - val_accuracy: 0.9468
Epoch 27/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0843 - accuracy: 0.9642 - val_loss: 0.1273 - val_accuracy: 0.9499
Epoch 28/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0816 - accuracy: 0.9652 - val_loss: 0.1282 - val_accuracy: 0.9478
Epoch 29/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0820 - accuracy: 0.9645 - val_loss: 0.1257 - val_accuracy: 0.9464
Epoch 30/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0808 - accuracy: 0.9652 - val_loss: 0.1315 - val_accuracy: 0.9452
Epoch 31/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0801 - accuracy: 0.9653 - val_loss: 0.1294 - val_accuracy: 0.9485
Epoch 32/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0785 - accuracy: 0.9659 - val_loss: 0.1326 - val_accuracy: 0.9475
Epoch 33/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0784 - accuracy: 0.9661 - val_loss: 0.1306 - val_accuracy: 0.9483
Epoch 34/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0761 - accuracy: 0.9675 - val_loss: 0.1292 - val_accuracy: 0.9476
Epoch 35/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0755 - accuracy: 0.9678 - val_loss: 0.1297 - val_accuracy: 0.9492
Epoch 36/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0757 - accuracy: 0.9675 - val_loss: 0.1371 - val_accuracy: 0.9470
Epoch 37/50
944/944 [==============================] - 3s 4ms/step - loss: 0.0756 - accuracy: 0.9674 - val_loss: 0.1417 - val_accuracy: 0.9460
Epoch 38/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0744 - accuracy: 0.9684 - val_loss: 0.1365 - val_accuracy: 0.9448
Epoch 39/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0747 - accuracy: 0.9690 - val_loss: 0.1333 - val_accuracy: 0.9455
Epoch 40/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0726 - accuracy: 0.9680 - val_loss: 0.1408 - val_accuracy: 0.9457
Epoch 41/50
944/944 [==============================] - 3s 4ms/step - loss: 0.0714 - accuracy: 0.9696 - val_loss: 0.1393 - val_accuracy: 0.9450
Epoch 42/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0714 - accuracy: 0.9697 - val_loss: 0.1374 - val_accuracy: 0.9448
Epoch 43/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0696 - accuracy: 0.9707 - val_loss: 0.1382 - val_accuracy: 0.9473
Epoch 44/50
944/944 [==============================] - 3s 4ms/step - loss: 0.0718 - accuracy: 0.9704 - val_loss: 0.1406 - val_accuracy: 0.9457
Epoch 45/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0697 - accuracy: 0.9702 - val_loss: 0.1384 - val_accuracy: 0.9473
Epoch 46/50
944/944 [==============================] - 3s 4ms/step - loss: 0.0691 - accuracy: 0.9707 - val_loss: 0.1388 - val_accuracy: 0.9469
Epoch 47/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0683 - accuracy: 0.9709 - val_loss: 0.1464 - val_accuracy: 0.9460
Epoch 48/50
944/944 [==============================] - 3s 3ms/step - loss: 0.0679 - accuracy: 0.9712 - val_loss: 0.1375 - val_accuracy: 0.9445
Epoch 49/50
944/944 [==============================] - 4s 4ms/step - loss: 0.0679 - accuracy: 0.9712 - val_loss: 0.1463 - val_accuracy: 0.9466
Epoch 50/50
944/944 [==============================] - 3s 4ms/step - loss: 0.0670 - accuracy: 0.9718 - val_loss: 0.1434 - val_accuracy: 0.9462
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_3.history['loss'])
plt.plot(history_3.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
In [ ]:
#Studying the difference between accuracy and epoch
plt.plot(history_3.history['accuracy'])
plt.plot(history_3.history['val_accuracy'])
plt.title('Accuracy vs Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()

Observations:

  • With more than ~5 epochs, the model begins to overfit the training data.
In [ ]:
# predict probabilities
yhat3 = model_3.predict(x_test_scaled)
# keep probabilities for the positive outcome only
yhat3 = yhat3[:, 0]
# calculate roc curves
fpr, tpr, thresholds3 = roc_curve(y_test, yhat1)
# calculate the g-mean for each threshold
gmeans3 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans3)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds3[ix], gmeans3[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
590/590 [==============================] - 1s 2ms/step
Best Threshold=0.590928, G-Mean=0.946
In [ ]:
#Predicting the results using best as a threshold
y_pred_e3=model_3.predict(x_test_scaled)
y_pred_e3 = (y_pred_e3 > thresholds3[ix])
y_pred_e3
590/590 [==============================] - 1s 2ms/step
Out[ ]:
array([[ True],
       [ True],
       [ True],
       ...,
       [False],
       [ True],
       [False]])
In [ ]:
metrics_score(y_test, y_pred_e3)
              precision    recall  f1-score   support

           0       0.92      0.96      0.94      8562
           1       0.97      0.93      0.95     10314

    accuracy                           0.95     18876
   macro avg       0.95      0.95      0.95     18876
weighted avg       0.95      0.95      0.95     18876

Observations:

  • This model has better performance than model 1 and 2, which confirmed that the previous model had too many hidden layers, resulting in more overfitting
  • In the next model, we will attempt to tune

Model 4 - Drop Out with batch Normalization¶

In [ ]:
#Clearing the backend resources
backend.clear_session()

#Fixing the randomness
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
model_4 = Sequential()
      #Adding the hidden and output layers
model_4.add(Dense(128,activation='relu',kernel_initializer='he_uniform',input_dim = x_train_scaled.shape[1]))
model_4.add(Dropout(0.2))
model_4.add(BatchNormalization())
model_4.add(Dense(64,activation='relu',kernel_initializer='he_uniform'))
model_4.add(Dropout(0.2))
model_4.add(BatchNormalization())
# model_4.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
# model_4.add(Dropout(0.2))
# model_4.add(BatchNormalization())
model_4.add(Dense(1, activation = 'sigmoid'))
      #Compiling the ANN with Adam optimizer and binary cross entropy loss function
optimizer = tf.keras.optimizers.Adam(0.001)
model_4.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
In [ ]:
model_4.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 128)               3072      
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 batch_normalization (BatchN  (None, 128)              512       
 ormalization)                                                   
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 batch_normalization_1 (Batc  (None, 64)               256       
 hNormalization)                                                 
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 12,161
Trainable params: 11,777
Non-trainable params: 384
_________________________________________________________________
In [ ]:
optimizer = tf.keras.optimizers.Adam(0.001)
model_4.compile(loss='binary_crossentropy',
                optimizer=optimizer,
                metrics=['accuracy'])
In [ ]:
history_4 = model_4.fit(x_train_scaled,
                        y_train,
                        batch_size=64,
                        epochs=50,
                        verbose=1,
                        validation_split = 0.2)
Epoch 1/50
944/944 [==============================] - 7s 5ms/step - loss: 0.3214 - accuracy: 0.8634 - val_loss: 0.2218 - val_accuracy: 0.9074
Epoch 2/50
944/944 [==============================] - 3s 3ms/step - loss: 0.2523 - accuracy: 0.8944 - val_loss: 0.1983 - val_accuracy: 0.9148
Epoch 3/50
944/944 [==============================] - 4s 4ms/step - loss: 0.2284 - accuracy: 0.9044 - val_loss: 0.1785 - val_accuracy: 0.9242
Epoch 4/50
944/944 [==============================] - 5s 6ms/step - loss: 0.2115 - accuracy: 0.9101 - val_loss: 0.1633 - val_accuracy: 0.9325
Epoch 5/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1977 - accuracy: 0.9172 - val_loss: 0.1552 - val_accuracy: 0.9354
Epoch 6/50
944/944 [==============================] - 3s 4ms/step - loss: 0.1895 - accuracy: 0.9211 - val_loss: 0.1480 - val_accuracy: 0.9372
Epoch 7/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1799 - accuracy: 0.9248 - val_loss: 0.1428 - val_accuracy: 0.9395
Epoch 8/50
944/944 [==============================] - 5s 5ms/step - loss: 0.1728 - accuracy: 0.9275 - val_loss: 0.1399 - val_accuracy: 0.9415
Epoch 9/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1678 - accuracy: 0.9298 - val_loss: 0.1343 - val_accuracy: 0.9438
Epoch 10/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1631 - accuracy: 0.9312 - val_loss: 0.1336 - val_accuracy: 0.9440
Epoch 11/50
944/944 [==============================] - 4s 5ms/step - loss: 0.1596 - accuracy: 0.9323 - val_loss: 0.1308 - val_accuracy: 0.9455
Epoch 12/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1547 - accuracy: 0.9351 - val_loss: 0.1276 - val_accuracy: 0.9454
Epoch 13/50
944/944 [==============================] - 3s 4ms/step - loss: 0.1508 - accuracy: 0.9366 - val_loss: 0.1288 - val_accuracy: 0.9481
Epoch 14/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1517 - accuracy: 0.9367 - val_loss: 0.1274 - val_accuracy: 0.9473
Epoch 15/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1478 - accuracy: 0.9381 - val_loss: 0.1241 - val_accuracy: 0.9479
Epoch 16/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1468 - accuracy: 0.9375 - val_loss: 0.1241 - val_accuracy: 0.9480
Epoch 17/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1424 - accuracy: 0.9385 - val_loss: 0.1243 - val_accuracy: 0.9479
Epoch 18/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1417 - accuracy: 0.9402 - val_loss: 0.1207 - val_accuracy: 0.9497
Epoch 19/50
944/944 [==============================] - 4s 5ms/step - loss: 0.1400 - accuracy: 0.9405 - val_loss: 0.1210 - val_accuracy: 0.9479
Epoch 20/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1384 - accuracy: 0.9409 - val_loss: 0.1226 - val_accuracy: 0.9470
Epoch 21/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1380 - accuracy: 0.9408 - val_loss: 0.1187 - val_accuracy: 0.9507
Epoch 22/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1378 - accuracy: 0.9411 - val_loss: 0.1181 - val_accuracy: 0.9492
Epoch 23/50
944/944 [==============================] - 5s 5ms/step - loss: 0.1352 - accuracy: 0.9415 - val_loss: 0.1198 - val_accuracy: 0.9511
Epoch 24/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1341 - accuracy: 0.9423 - val_loss: 0.1169 - val_accuracy: 0.9485
Epoch 25/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1305 - accuracy: 0.9434 - val_loss: 0.1172 - val_accuracy: 0.9490
Epoch 26/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1318 - accuracy: 0.9434 - val_loss: 0.1141 - val_accuracy: 0.9502
Epoch 27/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1312 - accuracy: 0.9429 - val_loss: 0.1179 - val_accuracy: 0.9511
Epoch 28/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1277 - accuracy: 0.9451 - val_loss: 0.1138 - val_accuracy: 0.9524
Epoch 29/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1305 - accuracy: 0.9439 - val_loss: 0.1133 - val_accuracy: 0.9515
Epoch 30/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1276 - accuracy: 0.9454 - val_loss: 0.1131 - val_accuracy: 0.9517
Epoch 31/50
944/944 [==============================] - 4s 5ms/step - loss: 0.1265 - accuracy: 0.9454 - val_loss: 0.1111 - val_accuracy: 0.9526
Epoch 32/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1267 - accuracy: 0.9457 - val_loss: 0.1106 - val_accuracy: 0.9536
Epoch 33/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1259 - accuracy: 0.9463 - val_loss: 0.1100 - val_accuracy: 0.9545
Epoch 34/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1257 - accuracy: 0.9463 - val_loss: 0.1110 - val_accuracy: 0.9523
Epoch 35/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1236 - accuracy: 0.9465 - val_loss: 0.1102 - val_accuracy: 0.9530
Epoch 36/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1236 - accuracy: 0.9472 - val_loss: 0.1104 - val_accuracy: 0.9544
Epoch 37/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1213 - accuracy: 0.9470 - val_loss: 0.1096 - val_accuracy: 0.9538
Epoch 38/50
944/944 [==============================] - 3s 4ms/step - loss: 0.1223 - accuracy: 0.9477 - val_loss: 0.1120 - val_accuracy: 0.9509
Epoch 39/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1224 - accuracy: 0.9471 - val_loss: 0.1099 - val_accuracy: 0.9540
Epoch 40/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1218 - accuracy: 0.9479 - val_loss: 0.1086 - val_accuracy: 0.9517
Epoch 41/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1201 - accuracy: 0.9488 - val_loss: 0.1078 - val_accuracy: 0.9531
Epoch 42/50
944/944 [==============================] - 3s 4ms/step - loss: 0.1204 - accuracy: 0.9481 - val_loss: 0.1083 - val_accuracy: 0.9528
Epoch 43/50
944/944 [==============================] - 4s 5ms/step - loss: 0.1201 - accuracy: 0.9484 - val_loss: 0.1081 - val_accuracy: 0.9546
Epoch 44/50
944/944 [==============================] - 3s 4ms/step - loss: 0.1213 - accuracy: 0.9478 - val_loss: 0.1082 - val_accuracy: 0.9542
Epoch 45/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1217 - accuracy: 0.9469 - val_loss: 0.1068 - val_accuracy: 0.9540
Epoch 46/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1187 - accuracy: 0.9482 - val_loss: 0.1080 - val_accuracy: 0.9530
Epoch 47/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1196 - accuracy: 0.9487 - val_loss: 0.1076 - val_accuracy: 0.9540
Epoch 48/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1194 - accuracy: 0.9477 - val_loss: 0.1067 - val_accuracy: 0.9544
Epoch 49/50
944/944 [==============================] - 3s 3ms/step - loss: 0.1172 - accuracy: 0.9490 - val_loss: 0.1069 - val_accuracy: 0.9552
Epoch 50/50
944/944 [==============================] - 4s 4ms/step - loss: 0.1184 - accuracy: 0.9488 - val_loss: 0.1060 - val_accuracy: 0.9539
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_4.history['loss'])
plt.plot(history_4.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
In [ ]:
#Studying the difference between accuracy and epoch
plt.plot(history_4.history['accuracy'])
plt.plot(history_4.history['val_accuracy'])
plt.title('Accuracy vs Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()
In [ ]:
# predict probabilities
yhat4 = model_4.predict(x_test_scaled)
# keep probabilities for the positive outcome only
yhat4 = yhat4[:, 0]
# calculate roc curves
fpr, tpr, thresholds4 = roc_curve(y_test, yhat1)
# calculate the g-mean for each threshold
gmeans4 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans4)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds4[ix], gmeans4[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
590/590 [==============================] - 1s 2ms/step
Best Threshold=0.590928, G-Mean=0.946
In [ ]:
#Predicting the results using best as a threshold
y_pred_e4=model_4.predict(x_test_scaled)
y_pred_e4 = (y_pred_e4 > thresholds4[ix])
y_pred_e4
590/590 [==============================] - 1s 2ms/step
Out[ ]:
array([[ True],
       [ True],
       [ True],
       ...,
       [False],
       [ True],
       [False]])
In [ ]:
metrics_score(y_test, y_pred_e4)
              precision    recall  f1-score   support

           0       0.92      0.97      0.95      8562
           1       0.98      0.93      0.95     10314

    accuracy                           0.95     18876
   macro avg       0.95      0.95      0.95     18876
weighted avg       0.95      0.95      0.95     18876

Observations:

  • This model performed better than the previous models, even if only a little.

We will then use this 4th model to predict the unseen data and submit for the hackathon.

There are many more ways to tune the model, but due to a time constraint, they were not explored.

Running best model on test data¶

In [ ]:
#Storing the test data that will be used to make prediction for the hackathon
traveldata_test = pd.read_csv('/content/drive/MyDrive/GreatLearning/Hackathon/Datasets/Traveldata_test.csv')
surveydata_test = pd.read_csv('/content/drive/MyDrive/GreatLearning/Hackathon/Datasets/Surveydata_test.csv')

#merge dataframes
test = pd.merge(traveldata_test,surveydata_test,how='inner',on='ID')
if n_passengers == test['ID'].nunique():
    print('merge is succesfull, all passengers are in the final dataframe')

#Converting all features with satisfactory scales to numerical variables
for column in appreciation_variables:
    test[column] = test[column].apply(cat_to_numerical)

#Converting Platform_Location to numerical variables
test['Platform_Location'].replace({'Very Convenient': 5,
                                    'Convenient': 4,
                                    'Manageable': 3,
                                    'Needs Improvement': 2,
                                    'Inconvenient': 1,
                                    'Very Inconvenient': 0}, inplace=True)

# Creating dummy variables for the categorical columns
test_data = pd.get_dummies(test,
                      columns = data.select_dtypes(include = ["object", "category"]).columns.tolist(),
                      drop_first = True) #Only apply this function to object and category variables

#Store the ID column in a variable first
id_column = test_data['ID']

#Remove the ID column as it is not needed for the scaling
test_data = test_data.drop('ID',axis=1)

#Scale the data
test_data_scaled=sc.fit_transform(test_data)
test_data_scaled=pd.DataFrame(test_data_scaled, columns=test_data.columns)

#Label the data with the respective ID
final_to_predict = test_data_scaled.join(id_column)

#Print out to see what the scaled data look like
final_to_predict
In [ ]:
#Make prediction using model 4, and drop ID column as it is not needed forthe prediction
final_to_predict['predictions']= model_4.predict(final_to_predict.drop('ID',axis=1))

# Apply the ROC-AUC best threshold for model 4's prediction
final_to_predict['predictions'] = final_to_predict['predictions'].apply(lambda x: 1 if x>thresholds3[ix] else 0)

#Save the predicted data into the csv file
final_to_predict[['ID','predictions']].to_csv("predictions.csv",index=False)
1113/1113 [==============================] - 3s 3ms/step
In [ ]:
#Print out the data to see what it looks like
final_to_predict[['ID','predictions']]
Out[ ]:
ID predictions
0 99900001 0
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 0
... ... ...
35597 99935598 0
35598 99935599 1
35599 99935600 1
35600 99935601 1
35601 99935602 0

35602 rows × 2 columns

The final_to_predict dataframe will then be used to fare against the hackathon.

My model obtained an accuracy of 80%